The Birthday Problem and Hash Collisions

Birthday Problem

In flipping through a book on American history, you come to notice that two of our 44 presidents, Warren G. Harding and James K. Polk, were born on November 2. An acquaintance with British history would reveal that, of 52 prime ministers, two pairs share the same birthday. And similarly for Australia -- of its 27 prime ministers, the 1st and the 24th were born on the same day. Something might be said for the peculiar birthday clustering of heads of state, if this wasn't also true for hockey teams (head coaches of the Montreal Canadiens, for one) or Academy Award-winning directors. These are relatively small groups of people -- 44, 52, 27 -- with at least a few members that share 1 out of 365 different birthdays. What's going on here?

Consider the following question: How large must a group of people be for there to be a 50% or greater chance that at least two of them have the same birthday? The answer, as we will see, is 23. With 57 people, we are almost guaranteed (99% probability) to find a match. These numbers are sufficiently low to challenge intuition; for this reason, this problem is also known as the Birthday paradox, although it is not a paradox in the logical sense. And it only seems paradoxical at first blush -- we can make sense of it if we consider it with care. First, let's see why this result evades our naive expectation. It has to do with the fact that we are likely unconsciously mulling a different problem: How large must a group of people be for there to be a 50% or greater chance of at least two of them to share a given birthday? Say, your birthday?

We can answer this question without much trouble. We will call the probability of finding a match \({\color{black}{{\color{black}{p}}}}\), but we will consider instead the probability that there is not a match: \({\color{black}{{\color{black}{\bar{p} = 1 - p}}}}\). Assuming that all birthdays in the year are equiprobable, we find \begin{equation} {\color{black}{\bar{p} = \left(\frac{364}{365}\right)^n}} \end{equation} for the probability that none of a group of \({\color{black}{{\color{black}{n}}}}\) people share your birthday. With \({\color{black}{{\color{black}{\bar{p} = p = 0.5}}}}\), there must be \({\color{black}{{\color{black}{n = 253}}}}\) people. Now this is about large enough to seem plausible. As we noted though, this is a different problem. We are not interested in matching a specific birthday, but instead matching any birthday. This too isn't difficult, with 1
We can also find \({\color{black}{{\color{black}{p}}}}\) by summing the individual probabilities that exactly 2 people, exactly 3 people, and so on, share birthdays. For \({\color{black}{{\color{black}{k=2}}}}\) people, we have $$ {\color{black}{p(k=2) = \frac{1}{365^2}\frac{1}{364^{n-2}}\cdot 364\cdot 363\cdots (364-n-3) = \frac{1}{365^2}\frac{1}{364^{n-2}}\frac{364!}{(364 - n -2)!}}} $$ Summing all cases, we have $$ {\color{black}{p = \sum_{k=2}^\infty \frac{1}{365^k}\frac{1}{364^{n-k}}\frac{364!}{(364-n-k)!}}} $$ which is admittedly less wieldly than Eq. (\ref{bday}). It is nice, though, that the above infinite series sums to the closed form expression Eq. (\ref{bday}).
$$ {\color{black}{\bar{p} = \left(\frac{365}{365}\right)\left(\frac{364}{365}\right)\left(\frac{363}{365}\right)\cdots \left(\frac{365-n+1}{365}\right) = \frac{365!}{(365-n)!}\frac{1}{365^n}}} \label{bbday} $$ giving the probability that no two out of \({\color{black}{{\color{black}{n}}}}\) people share a birthday. Note that this probability is zero when \({\color{black}{{\color{black}{n > 365}}}}\), since then there must be a match (this eventuality is sometimes referred to as the pigeonhole principle). Then, $$ {\color{black}{p = 1 - \bar{p} = 1 - {365 \choose n}\frac{n!}{365^n}}}. \label{bday} $$ The quantity \({\color{black}{{\color{black}{{365 \choose n}n!}}}}\) gives the number of ways \({\color{black}{{\color{black}{n}}}}\) distinct objects can be selected from a set of 365, without replacement, when the order is relevant. Solving Eq. (\ref{bday}) with \({\color{black}{{\color{black}{p = 0.5}}}}\) gives \({\color{black}{{\color{black}{n=23}}}}\) people. We have succeeded in demonstrating the truth of the Birthday Problem, but how are we to interpret it?

Consider that the number of different pairs that can be formed from a group of 23 people is \({\color{black}{{\color{black}{{23 \choose 2} = 253}}}}\). We've just seen this number -- it's the number of people that we need to sample in order to have a 50% chance of finding one that shares our birthday. And so it's not the number of people that matter -- it's the number of comparisons. In the first case that we analyzed, we were looking to match a given birthday. We found that 253 comparisons would be needed to furnish a 50% chance of a match, and since we are singling out one birthday, we must compare one person with 253 others. The next case, that addressed in the Birthday Problem, does not fix the birthday. We still need 253 comparisons to find a match, but since any birthday will do, these 253 comparisons are distributed among the sample group of 23 people. Each person is compared against 22 others, and there are 23 such sets of comparisons. After dividing by two to account for double-counting (it's the same whether we compare Bob to Alice or Alice to Bob), we find \({\color{black}{{\color{black}{23\times 22/2 = {23 \choose 2} = 253}}}}\).

Fig. 1 (left) Topology of the first case considered, in which a specific birthday (center) is compared against 253 others (points around the circumference). (right) Topology of the case considered in the Birthday Problem, in which the 253 comparisons are distributed among 23 people (points around the circumference). In both cases 253 comparisons are made. Images created by M. Kresch using gnuplot; the script is available here.

Hash Collisions

The birthday problem has relevance to the security of hashing protocols. A hash function, \({\color{black}{h}}\), is an algorithm that takes an arbitrary length input string (pre-image), \({\color{black}{m}}\), and creates a fixed-sized output (image), \({\color{black}{h(m)}}\). Cryptographic hash functions should be effectively one-way, in that given a hash, \({\color{black}{y}}\), it should be computationally infeasible to find a pre-image, \({\color{black}{m}}\), such that \({\color{black}{y=h(m)}}\); this is known as pre-image resistance. This is effectively a statement about the infeasibility of "inverting" the hash. Hash functions should also be resistant to collisions, of which there are a few types:

     Weak collision: Given a pre-image, \({\color{black}{m_1}}\), another pre-image, \({\color{black}{m_2}}\), is found such that \({\color{black}{h(m_1) = h(m_2)}}\). This is also known as a second-pre-image attack.

     Strong collision: Any two pre-images, \({\color{black}{m_1}}\) and \({\color{black}{m_2}}\), are found that hash to the same value, \({\color{black}{h(m_1) = h(m_2)}}\).

     Chosen-prefix collision: Given two different prefixes, \({\color{black}{p_1}}\) and \({\color{black}{p_2}}\), messages \({\color{black}{m_1}}\) and \({\color{black}{m_2}}\) can be found such that their respective concatenations satisfy \({\color{black}{h(p_1 || m_1) = h(p_2 || m_2)}}\).

Interestingly, the "one-way-ness" and collision resistance of a hash function are in tension: while a many-to-one function is hard to invert, the chance of collisions increases along with the number of messages that hash to the same image.

A weak collision was the subject of the first case that we analyzed: we had a fixed birthday that we were trying to match. Modern hash functions, however, have image spaces that are much larger than 365; for example, the hash algorithm MD5 produces a 128-bit hash, with \({\color{black}{2^{128} = 3.4\times 10^{38}}}\) possible values. For an \({\color{black}{N}}\)-bit hash function, one has a probability \({\color{black}{p}}\) of finding a weak collision after hashing \({\color{black}{n}}\) trial messages, where $$ {\color{black}{n = \frac{\ln(1-p)}{\ln(1-2^{-N})}}}. $$ For a 128-bit hash, we need \({\color{black}{n=2.36\times 10^{38}}}\) trials 2
Note that a calculation of the difference \({\color{black}{1-2^{-N}}}\) requires \({\color{black}{\mathcal{O}(N)}}\)-precision.
to have a 50% chance of finding a weak collision, and so in general, an \({\color{black}{N}}\)-bit hash requires \({\color{black}{\mathcal{O}(2^N)}}\) comparisons. As expected, this is of the same order of effort required to exhaust (brute-force) the full space of possibilities.

What about a strong collision? This is where the Birthday problem comes in: we don't care here what the matching hashes are -- just that we find a match. The probability of finding a collision is given by Eq. \({\color{black}{\ref{bday}}}\) with 365 replaced with \({\color{black}{2^N}}\) for an \({\color{black}{N}}\)-bit hash. We can find an approximation for the probability of finding a collision, \({\color{black}{p}}\), as a function of the number of trials, by making some clever adjustments to the \({\color{black}{N}}\)-bit hash version of Eq. \({\color{black}{\ref{bbday}}}\). First, we recast the product of fractions as $$ {\color{black}{1\times\left(\frac{2^N - 1}{2^N}\right)\times \cdots \times\left(\frac{2^N-n+1}{2^N}\right) = \prod_{k=0}^{n-1} 1-\frac{k}{2^N}}}. $$ For small \({\color{black}{k}}\), the fraction \({\color{black}{x = k/{2^N} \ll 1}}\), and so \({\color{black}{1 - x \approx \exp(-x)}}\) and, \begin{eqnarray} p \approx 1-\prod_{k=0}^{n-1}\exp\left(-\frac{k}{2^{N}}\right) &=& 1-\exp\left(-\sum_{k=0}^{n-1}\frac{k}{2^{N}}\right) \nonumber \\\\ &\approx& 1 - \exp\left[-\frac{(n-1)n}{2^{N+1}}\right], \end{eqnarray} where we have written the sum \({\color{black}{\sum_{k=0}^{n-1} k = (n-1)n/2}}\). We can now obtain the number of trials needed to find a weak collision as a function of probability $$ \label{bdaycoll} {\color{black}{n \approx \sqrt{-2\ln(1-p)}2^{N/2}}}. $$ For \({\color{black}{p=0.5}}\), the coefficient of \({\color{black}{2^{N/2}}}\) is close to 1, with the result that for an \({\color{black}{N}}\)-bit hash only \({\color{black}{2^{N/2}}}\) messages need to be sampled to have a 50% chance of finding a strong collision 3
An alternative, more boneheaded way to arrive at this result (and the way I originally derived it), is to use Stirling's approximation to get rid of the factorial terms: \({\color{black}{\ln n! = n\ln n - n + \mathcal{O}(\ln n)}}\). This approximation is actually very good for large \({\color{black}{n}}\), which is right where we wish to apply it. One finds, \begin{eqnarray} p &=& 1- \left(\frac{2^N}{n-2^N}\right)^{x}\exp(-n) \approx 1 - \exp\left[-\frac{(2n-1)n}{2^{N+1}}\right]\nonumber \end{eqnarray} where \({\color{black}{x = 2^N - n + 1/2}}\). This gives \({\color{black}{n = \sqrt{\ln(1-p)}2^{N/2}}}\), which happens to be off by a factor of \({\color{black}{\sqrt{2}}}\) from Eq. \({\color{black}{\ref{bdaycoll}}}\).
. All of this assumes that the hash function is ideal -- that it is a random mapping from the space of messages to the set of possible output values. 4
Keep in mind that we aren't randomly sampling the hash values directly, as we did for birthdays in the Birthday problem. In reality, we sample messages that we then hash; that an arbitrary sampling of messages implies a random "sampling" of hashes is an assumption only true for ideal hash functions.
No real hash function, no matter how good, is truly random in this sense, and so the results we are obtaining here represent the greatest security we can expect out of a hash function. Any weakness of a real hash function might be leveraged to find a collision before the inherent limitation imposed by the Birthday problem is reached. For this reason, a hash function is said to be cryptographically broken if an attack exists that succeeds within fewer than \({\color{black}{\mathcal{O}(2^N)}}\) rounds for a weak collision and \({\color{black}{\mathcal{O}(2^{N/2})}}\) for a strong collision.

All of that said, for a 128-bit hash, we need at most \({\color{black}{\mathcal{O}(10^{19})}}\) trials to have a good chance of finding a strong collision. How long will this take? It depends on the size of the message and the particulars of the hash function. For example, baseline implementations of MD5 can process around 300 MB/s on a modern 64 bit 1.8 GHz CPU [1], and the use of a GPU can increase this rate by a factor of 25 or so [2]. Suppose that we are seeking strong collisions of a 128-bit hash with the speed of MD5 among 1 KB text files: on a modern CPU, we'd be able to hash around 300,000 such messages per second. At this rate, it would take us a million years to have a 50% chance of finding a strong collision; a GPU would reduce this to tens of thousands years. Either way, this is infeasible. Of course, one could throw multiple GPU's at the problem: with a 100,000 GPU cluster, the time can be reduced to about half a year. A 100,000 GPU cluster is a thing of nations, not organized crime or professional hackers.

In summary, we have obtained limits on the numbers of trials necessary for having a good (\({\color{black}{p =50\%)}}\) chance of finding weak and strong collisions of an ideal \({\color{black}{N}}\)-bit hash function: the number of trials needed increases exponentially with \({\color{black}{N}}\) for weak and \({\color{black}{N/2}}\) for strong collisions. In designing a hash function, the specification should be based on the desired level of security; for example, the integrity of a certain document might need to be guaranteed for 50 years. Since the results obtained here apply to ideal functions, they should be interpreted as setting only upper limits on the computational resources required to perpetrate a successful attack. If the security requirement is a time limit as in the above example, we can vary not just the size of the hash but also the speed of the function. As we saw above, a perfect hash function with the size and speed specifications of MD5 will not succumb to a collision attack in a practical amount of time. However, an ideal 64-bit hash function with the same performance as MD5 will yield a strong collision in around 4 hours on a single CPU -- easily achievable by unsophisticated attackers, especially given that a real hash function might have flaws that further reduce the required effort.


Download as PDF