Why Is... Snot Salty? (And Other Interesting Questions)

Andrew B. Collier, Ph.D. Physics

Google Autocomplete dynamically offers suggestions as you type a search query. Many of these suggestions are spectacularly appropriate, and the system literally appears to read your mind. But where do these suggestions come from? And can we use them to get a general feeling for what people are thinking?

Autocomplete has been a feature on Google since 2008. Since then, every query typed into Google has been consumed by the insatiable Autocomplete engine. Its sophisticated algorithm crunches the enormous volume of queries (currently around 6 billion per day) and uses the results to make suggestions… while you are still busy typing. It does not attempt to anticipate the results of your search. That’s the job of Google Instant. Autocomplete tries to predict the query itself! The system seems to know what you are searching for before even you do, which can be rather unnerving (and in many cases rather amusing).

Suppose, for example, that I typed in a query starting with “pizza oven”. Now this simple query would yield a wide range of search results, many of which would be far from what I was looking for. So it would be better to use a more specific query. Autocomplete suggests a variety of less ambiguous queries.

Although popularity plays a role in ranking the suggestions, relevance is more important. Fresh searches that are currently popular are also likely to be promoted. However, personalized suggestions based on your search history are always top of the list. The suggestion “pizza oven refractory cement” comes from my search history. These personalized suggestions are differentiated from the others by having a “Remove” option.

Numerous amusing Autocompletions have made the rounds on Facebook and other social media sites. However, since Autocomplete is continually evolving with new data, it is worthwhile revisiting some of them. For example, suggested completions for “can a human” currently include “live on mars”, “get a sheep pregnant”, “get pregnant by a dog”, “get worms from a cat”, “catch a cold from a cat” and “give a dog a cold.” Evidently there is considerable interest in interspecies fertilization and transmission of ailments and parasites.

Many people are looking for either their iPhones or Chuck Norris, both of which are suggested completions for “where is”.

The search phrase “why are we” yields some existential suggestions, such as “here”, “here on Earth” and “alive”, as well as “at war with Afghanistan” and “yelling.” The related search “why do we” is completed with bodily functions like “yawn”, “dream”, “cry”, “hiccup”, “sneeze” and the somewhat incongruous “have daylight savings time.”

Equally deep are queries which start off with “is god” and spawn suggestions like “real”, “a man”, “a woman”, “dead” [1] and “an alien.”

In fact, aliens seem to be a popular theme. The partial query “aliens are” is completed by “fake”, “real”, “among us”, “coming”, “demons” and “from hell.” While “can an alien” generates “buy a house”, “join the US military”, “own a firearm” and “vote.” Evidently, these are referring to two completely different types of aliens.

Some suggestions are explicitly suppressed by Google. Generally this only happens for suggestions which might be considered hate speech, offensive or illegal. Yet sexism is alive and well on Google Autocomplete. The query “women should” results in suggestions like “not work”, “stay at home”, “not vote” and “be seen and not heard.” The equivalent query for men currently results in no suggestions at all! Maybe this is because queries about men are more likely to be phrased as a question? There is some evidence to support this idea. If you type in “why are men”, you get suggestions like “attracted to breasts”, “so mean”, “players” and “jerks.” You also get “better than women.” The equivalent query for women results in “so emotional”, “so difficult”, “so insecure” and “so complicated”, which I hope expresses genuine bewilderment rather than prejudice. Somewhat disturbingly, though, “posting inches on facebook” is given as a suggested completion for “why are ladies”.

No matter what the suggestion, it is always based on query data gathered by Google. Granted, there have been cases of people manipulating Autocomplete results, but generally the queries are typed in by real people who are genuinely looking for answers.

The way that Autocomplete gathers data makes it susceptible to stereotypes. For example, it responds to “South Africa is” with “dangerous”, “a mess”, “racist” and “a hell hole.” Rather damning, but probably quite representative of international attitudes. Some more positive suggestions are “truly an amazing country” and “where I come from.”

Recently a few interesting maps have been generated from Autocomplete results. One shows the responses to “Why is State so” [2], with interesting (and sometimes obvious) results, including “big” for Texas and “haunted” for Pennsylvania. Another map illustrates suggested completions for “State wants” [3] with (ironically) “an aircraft carrier” for Wyoming and “its own currency” for Virginia. But, by giving only a single Autocomplete response per state, both of these maps simply scratch the surface.

We can really dig a lot deeper.

What about looking at the complete list of suggestions for the “Why is State so” query? Performing the query across all states and retaining all of the suggestions gives 113 unique results. The frequency of these results is plotted in the histogram below. The overwhelming favorite is “boring”. Now that is a bit of an indictment! Next in order of popularity is the antithetic pair “cold” and “humid,” which is followed by another pair of antonyms: “liberal” and “conservative”. All of these terms have been used in queries relating to 18 or more states. At the other end of the spectrum, terms like “Mormon”, “square” and “salty” have only been suggested for a single state. No prizes for guessing which states they were associated with!

The figure below shows which terms are associated with each of the states. The diagonal line at the right of the figure is an artifact of the way that these data were gathered: the states were queried in alphabetical order. So, whenever a new term was introduced, it added to the ascending line.

This figure is a tessellation of libel and praise. We have confirmation that Wyoming is thought to be “square”; Kansas is considered “backward”, but Iowa is “progressive”; Oregon is “devoted to Tartuffe” (who would have thought?); California is “rich in life zones” (which is a fact, not just opinion [4]); New Mexico is “enchanting” (a sentiment with which I thoroughly concur); Alaska is “dark”, “far away” and “prone to earthquakes” (which does not paint a very attractive picture!); Connecticut, Maryland and Michigan are all described as “ghetto” (presumably the disparaging slang form of the adjective); wide open spaces are said to be found in Maine, Montana, Nevada, Vermont and Wyoming; while Utah has the distinction of being the only state which is described as both “salty” and “Mormon.”

Although some states are assigned unique attributes, there are many terms which are applied to multiple states. This suggests that we could use these data to group states according to common perceptions. The tree (also known as a dendrogram) below was generated using a method called Hierarchical Clustering [5]. The states that are close together at the ends of branches are the most similar in terms of query completions. For example, according to the data, Connecticut and Maryland are similar, but both are quite different from Florida… at least according to Autocomplete.

Some of the associations are fairly obvious. Nevada and New Mexico are both “dangerous”, “windy”, “liberal” and “dry.” They also share the last two attributes with Colorado.

New York and Alaska seem like an unlikely couple, but both are perceived to be “cold”, “big” and “expensive to live.” Hawaii is the only other state described as “expensive to live.” A number of states are simply “expensive.” They are scattered more or less uniformly across the dendrogram, their associations being more strongly determined by other terms.

California and Texas are both regarded as “big”, “hot” and “strict.” The last term comes as a bit of a surprise: to my mind California is liberal. But maybe “strict” is referring to laws, particularly those pertaining to weapons and emissions?

In terms of large-scale structure, the 16 states between Wyoming and Georgia are the “boring” cluster (it’s the term that all of them have in common, although the majority are also described as “conservative” and “racist”). The 13 states extending from Illinois to New Mexico can be considered the “liberal” cluster (although 11 of these states are also described as “boring”). Then there is the “racist” and “poor” cluster between Louisiana and West Virginia, and the “expensive” cluster of seven states between New Hampshire and New York. Back on the other side of the dendrogram, there is the “hot” cluster consisting of Florida, California and Texas, and finally the “cold” cluster of five states between Michigan and Utah.

Autocomplete yields a wealth of interesting social information and gives an indication of what a large selection of people are thinking. If Google can bring us driverless cars and space elevators [6], why shouldn’t they read our minds?

Postscript
To ensure that the data for the analysis above was not biased, it was generated using software outside of a browser. To achieve similar objectivity within a browser, you would need to sign out of any Google accounts and clear the browser history.

References

[1] http://ti.me/1fMf0q1
[2] http://bit.ly/M8J0EB
[3] http://on.mash.to/N5tv0R
[4] H. M. Hall and J. Grinnell, “Life-Zone Indicators in California,” Proc. Calif. Acad. Sci., vol. 9, pp. 37–67, 1919.
[5] https://en.wikipedia.org/wiki/Hierarchical_clustering

Discuss this article on our forum with over 1,900,000 registered members.

About Andrew B. Collier

Andrew B. Collier, Ph.D. Physics

Andrew lives in Durban, South Africa, with his wife and an extensive collection of used running shoes. He has a Ph.D. in Physics from the Royal Institute of Technology, Stockholm, but is currently masquerading as a Mathematician. He is interested in data analysis, automated FOREX trading, photography, cooking and running.

City-Data Blog

Why Is… Snot Salty? (And Other Interesting Questions)

About Andrew B. Collier

Related

Leave a Reply Cancel reply

Telling stories through data