Password linguistics: words in your security codes

Alexander Fishkov, Ph.D. student Computer Science

In a previous post about the 10 million password dataset, we treated passwords as atomic objects, focusing on the various usage statistics. Here we will study the contents of the passwords, looking into English words that people use in their credentials.

For this task we need to obtain a sort of dictionary — a list of common English words to look for in the password text. A popular source of wordlists is the SCOWL database. Along with a list of pre-built dictionaries, it has the functionality to create a custom wordlist for your specific needs. We used the base U.S. dictionary with the largest size, but retained only common American spelling variations and a specialized “hacker” mini-dictionary (e.g. “qwerty” becomes a word). The total number of words in the dictionary is around 500,000. While matching the passwords against the dictionary, we follow the “greedy” approach: passwords “baseball123” will match only the word “baseball” and not “base” or “ball“.

Using the above procedure we get a list of words from the dictionary with the number of times they were used in passwords. First, we plot this count against the word’s rank in the ordered list to see if the relationship is similar to that of a natural language. We can clearly see that the relationship is far from classic Zipf’s law.

The difference is most visible in the top ranks where the most popular words are. Zipf’s law suggests that there will be a few extremely popular words that dominate the previous ranks by orders of magnitude, like “the,” “be” or “and” in English. In the passwords dataset we see that at the head of the list, the counts lie very close.

The most popular word is “love” (used 30,337 times), with “password” and “qwerty” in second and third places. In terms of the number of characters used, the longest words are “anthropomorphically,” “antivivisectionist” and “anachronistically.” Notice that all of them have a prefix of “an.”

If we split the dataset by the number of words used in the password, we see that over 55 percent have no words in them (or at least no words from the dictionary used).

The highest number of distinct words in a password is eight. Here are some peculiar examples of long passwords: “steel running wild with the stars above” and “club penguin is my life and its cool.” It may seem silly at first, but passwords like this are more “efficient:” as we saw in a previous post, the average eight-symbol password gives 7*10^14 combinations, while three words from our dictionary create 500000^3 ~= 1.25*10^17 combinations. Remembering three words is much easier than eight possibly unrelated characters, and it also gives you three orders of magnitude in strength against a brute force attack!

Discuss this article on our forum with over 1,900,000 registered members.

About Alexander Fishkov

Alexander Fishkov, Ph.D. student Computer Science

Alexander is a Ph.D. student in Computer Science. He currently holds B.S. and M.S. degrees in Applied Math. He has experience working for industry major companies performing research in the fields of machine learning, data mining and natural language processing. In his free time, Alexander enjoys hiking, Nordic skiing and traveling.

One thought on “Password linguistics: words in your security codes”

Oscar says:

March 25, 2017 at 10:03 am

Nice informative article, I know it’s 2 years old, however I disagree with you in the use of grouping words. It would still be susceptible to brute force dictionary attack and reverse hashing. The use of grouping words would serve users best only if they use it as a pass-phrase. An example is “I love my dog skip he is cute” and then use letter combinations from this phrase and number substitution for letters as such; ILmD5h1c

Great article, I enjoyed the reading