Passwords on the internet: publicly available dataset

Alexander Fishkov, Ph.D. student Computer Science

Preamble

Protecting your personal data on the web has always been a big concern. Right now, a lot of sensitive information about an individual can be accessed using the internet, including social media and online banking services. Your passwords become the first line of defense against cyber-criminals of any kind. Having a good password has become a must. Most people don’t even think about going out without locking the house, and many will not leave a car window down in an unknown neighborhood. The situation is very different on the internet, however; many people still use “qwerty” or “123456” as their password, and it is often written down on a piece of paper stored near their PC.

Now, instead of further scolding the readers about the security issues with their passwords and giving the usual advice on making a better password, we would like to analyze the properties of real passwords in use on the internet. It can be difficult to find a good dataset for these purposes. Luckily, security specialist Mark Burnett has released a collection of ten million usernames and passwords gathered from the internet. This is one of the largest data collections of this sort available to the public. One may question whether this sample is representative and up-to-date, but this discussion is beyond the scope of this post. The main point is that these combinations were used by real people at some point, and they can still lead to interesting findings.

Popularity and uniqueness

Let’s begin by examining the popularity of different passwords. Among the 10 million passwords in the dataset, only about one-half are unique. Among these, we have calculated the top 20 most popular passwords. About half of them include number sequences. Folklore text passwords like “querty” and “letmein” are also popular. Together, these passwords constitute only 1.8 percent of the dataset.

If we look at the total ranking by usage count, passwords from the dataset roughly follow Zipf’s law, an empirical relationship that is most notably observed in natural languages. In this case, it can be described as follows: For each order of magnitude of rank, we get a proportional increase in the popularity’s order of magnitude.

passwords_zipf2

 Password properties

Some modern websites and services impose certain restrictions on your password length and character set. Let’s explore how this affects the 10 million password dataset. To do this, we count the absolute number of occurrences as well as the number of unique passwords.

Password length is distributed approximately exponentially, with the maximum number of passwords at a length of eight characters: 2.98 million. This is about 30 percent of all the passwords in the dataset. It is not surprising since a majority of internet services now require a minimum password length of eight characters. Together, passwords of six to nine characters cover 73 percent of all entries.

Length is not the only factor in a password’s strength. The number of characters in the alphabet that were used is also important. It might seem strange at first, since a single password cannot contain more unique characters than its length. A brute-force attack is usually assumed, for which one has to define a set of possible characters before running the attack and they are usually specified as a union of groups. Canonically, these groups are letters (a-z, A-Z), digits (0-9) and special characters (!@#$…). One can also distinguish between capital and lowercase letters, but it will make our graph less pretty.  The majority of passwords (99 percent) can be described by a pattern (a-zA-Z0-9).

The alphanumeric set (a-zA-Z0-9) contains 72 characters that give us 72^8 (around 7*10^14) combinations for the most popular password length.

 

Discuss this article on our forum with over 1,900,000 registered members.

About Alexander Fishkov

Alexander Fishkov, Ph.D. student Computer Science

Alexander is a Ph.D. student in Computer Science. He currently holds B.S. and M.S. degrees in Applied Math. He has experience working for industry major companies performing research in the fields of machine learning, data mining and natural language processing. In his free time, Alexander enjoys hiking, Nordic skiing and traveling.

Other posts by Alexander Fishkov:

2 thoughts on “Passwords on the internet: publicly available dataset”

  1. Confused by your data set showing how popular passwords with digits only are and letters only. This, despite so many services now requiring special characters. Doesn’t seem very comprehensive.

Leave a Reply

Your email address will not be published. Required fields are marked *