The art and science of passwords

Humanicus
17 min readNov 30, 2021

--

Photo by Towfiqu barbhuiya on Unsplash

Choose passwords, keep them while protecting them, succeed in revealing them: the science of passwords is a treasure trove of subtleties.

We all use passwords to protect our accounts, email, access to particular sites, computers, phones, etc. Often, when we choose a password, the system refuses it, telling us that our choice is too simple and that annoys us. Among the advice given was not to write down passwords on a piece of paper next to the computer. It is also necessary to give up using the same password on several websites, to change it regularly and… not to forget it.

To fully understand all this, is a little common sense enough? Nay! Do you know why 6 characters are not enough for a good password and why you should not use all lowercase letters? Did you know that there are methods of indirectly storing passwords on a computer server which, even if visited by a hacker, will prevent the hacker from profiting from what it finds? Did you know that between “doing a lot of calculations” and “memorizing a lot of data”, you can find compromises and thus succeed in breaking passwords that would otherwise be inaccessible? Understanding all of this will be our goal.

When you are asked for a password, it falls within a certain space of possibilities, the size of which you should be aware of. If you choose 6 lowercase letters for your password, for example “afzjxd”, “uncle”, “secret”, “wwwwww”, the space has 266 = 308 915 776 possibilities. There are indeed 26 possible choices for the first letter, 26 possible choices for the second, etc. These choices are independent, you are not required that all letters be different, and therefore the size of the password space is the product of the possibilities, which gives 26 × 26 × 26 × 26 × 26 × 26 × 26 = 26^6.

Combinatorics

If you choose a password of 12 characters taken from upper case, lower case, 10 digits and 10 symbols & é §? ç ù œ / + =, there are 72 possibilities for each of the 10 signs of the word; the space of possibilities is then 7212, that is to say:

19 408 409 961 765 342 806 016 ≈ 19 × 10^21.

This is over 62 trillion times the size of the first space. Considering all the passwords in the second space one at a time will take 62 trillion times longer. If it took your computer a second to visit the first space, it would take two million years for it to examine each of the passwords in the second space: the proliferation of possibilities has made some possible operations on the first space impractical.

To measure the size of these spaces, it is common to count the number of binary digits of the number N of possibilities. This number is given by the formula 1 + integer part (log2 (N)). For the first example, the measurement gives 29 bits, and for the second 75 bits. We also talk about 29-bit entropy, or 75 bits.

The National Information Systems Security Agency (Anssi) advises on these issues. When it comes to passwords or secret keys for encryption systems that you absolutely want to be secure, she advises only using spaces of at least 100 bits. It is even recommended a size of at least 128 bits if you want the security to be ensured for several years. The Anssi specifies that spaces of less than 64 bits must be considered as very small, as weak those between 64 and 80 bits, as medium those between 80 and 100. The difference between instantaneous security and long-term security comes from Moore’s Law, according to which the computing power of a computer at a given price doubles about every two years. Even if Moore’s Law seems to be losing its pace, it is wise to take it into account for passwords that we want to secure permanently.

How long to go through all the passwords?

To make a password difficult to find, it must be chosen uniformly from a wide range of possibilities. According to the length Ade the list of characters authorized to compose it and according to the number N of symbols which compose the password, the size T = AN of this space varies considerably. Here are five examples.

In each of the examples, we specify A, N, T and the number D of hours it takes to explore the space of possibilities, followed by the number X of years that it would take, assuming that the law of Moore (“doubling of computing capacity every two years”) is still valid, so that this space becomes explorable in less than an hour. We take as a starting point the assumption that in 2018 your machine explores a billion possibilities per second. So we have :

  • T = AN, D = T / (109 × 3600),
  • X = 2 log2 [T / (109 × 3600)].

_____________________________________________________

  • A = 26, N = 6
  • T = 308 915 776
  • D = 0.000 085 8 hour of calculation
  • X = 0 year waiting

_____________________________________________________

  • A = 26, N = 12
  • T = 9.5 × 1016
  • D = 26,508 hours of calculation
  • X = 29 years of waiting
  • A = 100, N = 10
  • T = 1020
  • D = 27,777,777 computing hours
  • X = 49 years of waiting

_____________________________________________________

  • A = 100, N = 15
  • T = 1030
  • D = 2.7 × 1017 hours of calculation
  • X = 115 years of waiting

_____________________________________________________

  • A = 200, N = 20
  • T = 1.05 × 1046
  • D = 2.7 × 1033 hours of calculation
  • X = 222 years of waiting

_____________________________________________________

For a truly robust password in the Anssi sense, it would be necessary, for example, to request a sequence of 16 characters each taken from a set of 200 characters. This would make a 123-bit space.

This is hardly an option, so system designers are less demanding and are content with weak or medium password spaces; they only apply the long key requirement when the user does not have to memorize it themselves, as it is handled automatically by the system.

Other methods are used to prevent exploration of the space of possibilities and crack passwords. The simplest is well known and used by bank cards: after three unsuccessful attempts, the card is blocked. Other ideas have been suggested such as doubling the wait times between two retries with each new error, unless a long time has passed, for example 24 hours, in which case the system will reset.

These methods are ineffective, however, when the attacker of the system can work offline and secretly conduct tests in isolation, or the system cannot be configured to stop and curb unsuccessful attempts.

Quite often, an attacker succeeds in obtaining the encrypted password or password fingerprint (we’ll see what a fingerprint is) by gaining access to data on the system he is attacking. He can then, in complete peace, for days or even weeks, make any attempt he wishes to crack the passwords he has seized in encrypted form.

Before explaining the subtle devices used by such attackers, let’s review the size of the space of possibilities.

When we evaluated the size in bits of a key or password space, called entropy, we implicitly assumed that the user uniformly randomly chooses their password from that space. Most often, this is not the case, because to memorize a password we prefer a common word, for example “locomotive”, to a really arbitrary word, for example “xdichqewax”.

Dictionaries to attack

This is a serious problem that gives rise to dictionary attacks. Lists of commonly used passwords have been collected and categorized according to the frequency with which these passwords are used. An attacker will attempt tries by taking one of these lists and using it in order. This works remarkably well because, without particular constraints, we are all tempted to choose simple words, surnames, first names, short sentences, and the possibilities are then relatively few.

This uneven use of possibilities is tantamount to a reduction in the space of possibilities, which decreases the average number of attempts to crack a password.

Here are the first 25 items of one of these password dictionaries. It is taken from a database of 5 million passwords leaked in 2017 and spotted by SplashData:

1,123456; 2. Password; 3,12345678; 4. qwerty; 5. 12345; 6. 123456789; 7. letmein; 8. 1234567; 9. football; 10. iloveyou; 11. admin; 12. welcome; 13. monkey; 14. login; 15. abc123; 16. starwars; 17. 123123; 18. dragon; 19. passw0rd; 20. master; 21. hello; 22. freedom; 23. whatever; 24. qazwsx; 25. trustno1.

Notice the “password” in the second position and the nice “iloveyou” in position 10. Of course, such lists change according to the country where they are collected, the sites concerned and over time.

For 4-digit passwords (eg PIN code for phone SIM cards) the results are even more amazing. In 2013, the DataGenetics site, relying on a collection of 3.4 million passwords, each with 4 digits, measured that the most used 4-digit sequence was 1234, which represented 11% of the choices, followed by by 1111 and 0000 with scores of 6% and 2%.

Hackers disappointed with hash functions

Cryptographic hash functions are carefully developed processes that transform a computer file F, however long, into a small-sized sequence h (F) called hash of F. For example, the function denoted SHA256 transforms the sentence “The weather is nice” in: 316FF650DC5FFC6F8B9CFF25069942F4AA5088 78930A1410D81D5F1DD8C71077 (64 hexadecimal characters which are equivalent to 256 bits).

Changing a single character in the file completely changes its footprint. For example, for “the weather is nice”, the SHA256 function gives the fingerprint: D66D08EC93EBA6ACCDD552E2A38EBE6474285E0D4F2370A87E206553E29AB93C.

You can redo and verify these calculations at: https://passwordsgenerator.net/sha256-hash-generator/ or https://www.xorbin.com/tools/sha256-hash-calculator.

Good hash functions produce results analogous to those which would be obtained if they were drawn uniformly at random. Above all, for a possible arbitrary result (a sequence of 64 hexadecimal characters), it is impossible to find in a reasonable time an F file with this fingerprint.

There are several generations of hash functions. The SHA0 and SHA1 generations are considered obsolete and deprecated. The SHA2 generation, to which SHA256 belongs, is considered safe. Instead of memorizing their client’s passwords, Internet servers must memorize the fingerprints of these passwords. In the event of an intrusion, it will then be impossible or very difficult for the hacker to exploit what they find.

The least used 4-digit password was 8068. Beware though, that may not be true now that these results have been released. The choice 8068 was only present 25 times among the 3.4 million 4-digit sequences in the database, which is much less than the 340 uses that would have been found if each 4-digit combination had been used with the same frequency.

The first 20 series of 4 digits are: 1234; 1111; 0000; 1212; 7777; 1004; 2000; 4444; 2222; 6969; 9999; 3333; 5555; 6666; 1122; 1313; 8888; 4321; 2001; 1010.

Even without a password dictionary, using the differences in letter frequencies (or double letters) in a language can effectively organize an attack. Some attack methods also take into account that, to facilitate memorization, we choose words with a certain structure such as A1 = B2 = C3, AwX2AwX2, O0o.Ili. (which I have used for a long time), or obtained by combining several single words like password123 or johnABC0000. By exploiting such regularities, we determine an order of attack of the passwords which accelerates the research. The conclusion is simple: in a given space, you really have to choose your password at random. Some software does this for you and therefore gives you a random password. However, be aware that software that generates a random password risks using, deliberately or not, a bad pseudo-random generator, in which case what it offers will be imperfect.

Informed of all this, properly designed websites analyze the passwords offered at the time of their creation and refuse those that would be too easy to find. It’s annoying, but it’s for your good!

Test passwords …

A tool allows you to check if one of your passwords has not already been hacked. It uses a database of more than 500 million passwords obtained by grouping together lists revealed following various attacks:

https://haveibeenpwned.com/Passwords

I tried e = mc2e = mc2 which I liked and thought sure, and got the bewildering response: “This password has been seen 110 times before. A few other tests show that it is difficult to come up with a password that is easy to remember and that the base does not know. For example, aaaaaa has occurred 353,071 times; a1b2c3d4, 105134 times; abcdcba, 353 times; abczyx: 158 times; acegi, 111 times; chirac, 600 times; sarkozy, 680 times; holland, 832 times; macron, 279 times.

Rainbow tables

You are a hacker and you are looking for a way to exploit one or more data that you have obtained. These data are of the type [user ID, h (password)], where h is a known and fixed hash function (see Box 2). The password belongs to the resulting space with 12 lowercase letters, which corresponds to 56 bits of information and 2612 = 9.54 × 1016 possible passwords.

Method 1. You scroll through all possible M passwords. You calculate everyone’s h (M) footprint by seeing if it’s the right one. You don’t need a lot of memory, because each time you try again, you erase the previous results, except of course the cue point which lets you know where you are in your enumeration.
On the other hand, it takes a long time for you to scroll through all the possible passwords. If your computer can perform a billion tests per second, you need 2,612 / (109 × 3,600 × 24) = 1,104 days, or about 3 years. It is not impossible to imagine and, using a computer network of a thousand machines, a day will suffice. However, it is not possible to repeat such a calculation each time you discover new data in the same category.

Method 2. You say to yourself, “I’m going to calculate fingerprints for all the passwords, which will take some time, but I’m going to memorize these fingerprints in a large table.” Then I’ll just have to look at this table to find out what password was used, and I will use it for any new password in the same category that I can find the fingerprint of. “

It will take (9.54 × 1016) × (12 + 32) bytes of memory, because it takes 12 bytes for the password and 32 for the fingerprint if it has 256 bits as for the SHA256 function. That’s 4.2 × 1018 bytes, or 4.2 million hard drives with a capacity of 1 terabyte!

Memory required is too large. This method is no more feasible than the first. Method 1 requires too much calculation, method 2 too much memory. That gets stuck in both cases: the computation is too long for each new problem, or the precomputation with storage of the results too large to keep the computations made.

Could we find an intermediate method which would require less calculations than method 1 for each new data, but perhaps a little more memory, and which would require less memory than method 2, but perhaps a little more? calculations? Yes it’s possible. The idea was proposed by the American Martin Hellman in 1980, then perfected in 2003 by Philippe Oechslin, of the Federal Polytechnic School of Lausanne, and more recently still by Gildas Avoine, of the Insa of Rennes.

Here is how to proceed: we first equip ourselves with a C function which, from the h (M) hash of the password M produces a new password C (h (M)). Fingerprints are numbers written in binary, and passwords are numbers written in a number base with as many digits as there are allowed symbols (say K). This C function, known as conversion, will for example be a conversion from base 2 to base K. The precomputation creates data tables called rainbow tables, probably because they are like a set of parallel lines. . For a data item in this table, we start from any password M0, we calculate its fingerprint h (M0), then a new possible password C (h (M0)) = M1, and we start again from M1, and so on. Without memorizing anything other than M0, we thus calculate the sequence M1, M2, … until h (Mn) begins with 20 zeros, which only occurs once in about 1,000,000 ( because what produces a hash function is comparable to a uniform random selection
and that 220 is worth roughly 1,000,000). The pair [M0, h (Mn)] is then stored in the table.

We calculate a very large number of couples of this type. If, during a calculation, we fall on the same final h (Mn), we add nothing, except if the calculation was longer (in number of steps), in which case we keep the new pair and we erases the old one.

Each pair [M0, h (Mn)] represents the series of passwords M0, M1, …, Mk and their fingerprints, but without memorizing them. If you do not want to leave any empty space in the calculated table, the calculation time will be longer than that required to scroll through all the passwords in the space you are exploring. This can be done with a network of computers, but we can only partially calculate the table: the database created will only resolve part of the passwords whose fingerprint we will know. In the case of a complete calculation, the memory space to store (indirectly) all the calculated couples is, in order of magnitude, a million times smaller than that of method 2 considered above. It is therefore less than 4 hard drives of 1 terabyte. This time it’s easy.

Let’s see now how this data recorded in the disks allows to find each password of the space envisaged in a few seconds. We will assume for the sake of reasoning that in the precomputation phase of the table, all the passwords of the target category have been taken into account, for example 12-character passwords taken from the 26 letters of the alphabet.

To examine an e0 fingerprint, we proceed as follows. We calculate h (C (e0)) = e1, then h (C (e1)) = e2, etc., until we get an em fingerprint that starts with 20 zeros. We then look in the table to which M0 it is associated. There is necessarily one, because the password we are studying is in one of the series we have scrolled through. We then start from this M0 and calculate h (M0), h (C (h (M0))) = h1, etc., until we find e0, which necessarily happens. Suppose this happens for hk (i.e. hk = e0). The password we are looking for is then C (hk — 1).

The computation time required is that it takes to find em in the table, to which must be added the computation time of h1, h2, …, hk, which is about a million times shorter than the time of calculation having been necessary to calculate the table, that is to say very reasonable.

Thus, having made a precomputation (very long) and having partially recorded the results allows to find in a reasonable time any password of which we know the imprint.

In summary, the idea is: (1) to classify passwords into packets, in each of which are the passwords that result in the same fingerprint starting with 20 zeros; (2) when faced with a password whose fingerprint we know, to first find the package to which it belongs thanks to what we have memorized; (3) Determine in this package where the password is located by performing a fairly short calculation.

Note that being original remains possible. Here are five examples of passwords that the site does not have in its list: eyahaled (my name backwards); bizzzzard; bad space; modeuxpass; abcdef2018. Now that I have tried them, I am wondering if the base could add them during its next update and therefore I will not use them!

The National Institute of Standards and Technology (NIST) in the United States recently published an advisory recommending the dictionary method to filter the password chosen by a user who is creating one.

… And know how to keep them

Among the rules that a good web server designer must absolutely apply, there is this one: do not keep in the memory of the computer which operates the website the list in clear of the couples [user identifier, password].

The reason is obvious: a hacker could access the computer containing this list, either because the site is poorly protected, or because a deep flaw in the system, or in the chip at the heart of the machine, has been exploited. , while it is unknown to everyone (zero-day flaw)… except the attacker.

A first idea is to encrypt the passwords in the list, that is, to use a secret code that transforms them with an encryption key into apparently random sequences of signs for anyone who ignores the decryption key. This method works, but has two drawbacks. You have to decipher the password associated with a user each time you want to compare it to the one it indicates, which is not very easy. More seriously, to carry out this decryption, which is necessary for the comparison, it is necessary to keep the decryption key in the memory of the site computer. However, this key could be spotted by the hacker, which brings us back to the previous problem!

Chopping… preceded by salting!

The correct method is with hash functions which produce hashes. A hash function is a function which, with any given file F, associates a “fingerprint” (we also speak of “hash”, or of “hash”) h (F), which is a rather short word, specific to the file. F, but produced in such a way that in practice it is impossible to find F from h (F). We speak of a one-way function: going from F to h (F) is easy, going from h (F) to F is impossible in practice. In addition, the hash functions used have the property that even if two files F and F ‘may have the same hash h (F) = h (F’) (we speak of collision), in practice no one knows, for a given F, to find an F ‘having the same fingerprint as F.

With such a hash function, storing passwords on a computer server is easy: instead of the list of couples [username, password], only the list of couples [username, h (password)].

When a user wants to log in, the server will read their mdp password, calculate the h (mdp) fingerprint and see if this is what their list of memorized pairs indicates in front of the user ID. The hacker will be very annoyed, because even if he was able to access this list, he will not be able to know the users’ passwords (in practice impossible to switch from h (mdp) to mdp), and will also be in the impossibility of producing another password mdp ‘which would have the same fingerprint and would fool the server (practical impossibility of collisions).

This recommended caution is really important, because even large sites that take the trouble to protect themselves against intrusions are regularly hacked. In 2016, the Yahoo! the data of a billion accounts had been stolen!

To further complicate the use of a stolen list of the type [identifier, h (password)], the “salting” method is sometimes used which consists in adding to the password a string of random characters, different for each password. This way, even if two users use the same password, the memorized footprints will be different. Of course, it will then be necessary for the list held by the server to contain three elements for each user: [username, h (password + salting), salting]. When the server wants to verify the password a user has just given, it adds the salt to it, calculates the fingerprint, and compares it with what it has in its database.

Even when users’ passwords are weak, this method makes the hacker’s job much harder. Without salting, it can calculate all the fingerprints in a dictionary and look at which ones are present in the stolen data; all the passwords present in its dictionary will be identified. With salage, the work of exploiting his dictionary becomes more complicated: the hacker is obliged, for each user, to calculate the fingerprints of all the passwords in his dictionary to which he adds the user’s own salting (that he knows since he stole the list). If there are 1000 users, that multiplies by 1000 the calculation to be made to use his dictionary.

A compromise between too much computation and too much memory

To defend yourself well against an attacker, it is better to know the methods that he could use to locate passwords and more generally to exploit a table of couples [user identifier, h (password)]. We are caught between two difficulties: either it is necessary to calculate a lot, that is to say to have a great power of calculation, or it is necessary to have a lot of memory to carry out a long precomputation, possibly for weeks or months, by recording everything that we calculate and that we can then easily use.

Often, neither method works. There remains the hope of devising an intermediate method, which requires less memory to store the result of a calculation made in advance, and which, at the time of its use, requires a reasonable amount of calculation. Box 3 presents the rainbow table method, which achieves this compromise. The science of passwords, in the age of the Internet, powerful computers and computer networks, is the scene of a merciless struggle between those (each of us) who want to use them to protect ourselves, and those who seek to know them in order to profit from them… at our expense.

--

--

Humanicus
Humanicus

Written by Humanicus

Please follow me since now we need 100 min follower on medium

No responses yet