A strong 64-bit hash function in Java

Consider a hash function and some random bunch of strings or objects that we apply that hash function to. In some situations, it would be advantageous if we could guarantee to all intents and purposes that each string or object will have a unique hash code. In that situation, we would not need to store the actual keys in a hash table, because the hash code alone would in effect identify that object.

On this page we look at what we'll call a strong hash function¹. A strong hash function is one that produces hash codes with good dispersal. In other words, given a random set of objects to hash, there is a high chance that our hash function will produce corresponding hash codes that are well dispersed over the possible range of hash codes. Hence:

whatever the size of the hash table, the number of collisions will be close to the number that we would "theoretically" expect;
for a given hash code width (i.e. the number of bits in the hash code), we can predict how likely it is for the abovementioned goal to be met— i.e. given a typical random selection of objects, for each object to be given a unique hash code.

Notice that Java's hash table implementation— and hence implementations of hashCode()— don't require this goal to be met. Java maps and sets, rather than storing just the hash code, store the actual key object. This means that implementations of hashCode() generally only need to be "fairly good". It isn't the end of the world if two key objects have the same hash function, because the keys themselves are also compared in deciding if a match has been found.

The limitations of 32-bit hash codes

Unfortunately, with 32-bit hash codes as returned by the Java hashCode() function, doing away with storing the keys will not usually be option: even with the best 32-bit hash function in the world, the chances of a duplicate are non-negligible with fairly modest numbers of objects. Figure 1 shows, for differently sized collectons of objects, the approximate percentage chance of two or more of those objects resolving to the same hash code given a strong hash code:

Table 1: For n random objects with a good 32-bit hash function, approximate chance of at least 2 of the objects having the same hash code.
Number of objects	Chance of a hash code collision
65536 (2¹⁶)	39.3%
32786 (2¹⁵)	11.8%
16384 (2¹⁴)	3.08%
8192 (2¹³)	0.78%
4096 (2¹²)	0.20%
2048 (2¹¹)	0.049%
1024 (2¹⁰)	0.012%
512 (2⁹)	0.0031%

For very small numbers of items added to a hash table keyed only on a 32-bit hash code, the risk of a collision (or chance of needing to take special action) may be acceptable. For example, with 512 items in the table, we'd expect about a 1 in 32,000 chance of a collision (0.0031%). On the other hand, adding 16,384 items means there's a 3.08% or about 1 in 32 chance of two different objects sharing the same hash code. For many applications, this is a modest number of items and too high a probability.

Why 64-bit hash codes?

The pattern of percentages shown in Table 1 essentially holds for hash codes of greater widths too. Or put another way, with an n-bit hash code, there's a roughly 39.3% chance of a collision with 2^n/2 items, an 11.8% chance with 2^{n/2 - 1} items, a 3.08% chance with 2^{n/2 - 2} items etc. Or to give a concrete example, using a 56-bit hash we reach the 0.0031% chance of a collision after adding 2^{56/2 - 7} = 2²¹ = 2,097,152 items. Or put another way, with a good 56-bit hash code, we can safely generate hashes for a few hundred thousand objects and expect those hash codes to be unique.

Rather than use 56-bit hash codes, we may as well use the next size up that it is convenient to manipulate: 64 bits. After all, if we used 56-bit hash codes in Java or practically any programming language, we'd probably end up storing them in 64-bit longs anyway. With 64-bit hash codes, we can store 2^{32 - 7} or over 30 million items in our hash table before there is a 0.0031% chance of a collision. With one million items and a perfect 64-bit hash function, the chance of getting a collision is 1 in 3.7x10⁷— or roughly half as likely as winning the UK National Lottery jackpot.

Implementation of a 64-bit strong hash code

On the next page, we look at an example implementation of a 64-bit hash code.

1. It should be noted that by "strong", we don't necessarily mean secure. In other words, it doesn't matter to us if the hash function is reversible and that, for a given hash code, it is feasible to compute a string/object that has that hash code. Conversely, so-called secure hash functions such as the SHA series aim to make such reversal unfeasible. A secure hash function is also a strong hash function, but not necessarily vice versa.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey