The mathematics of hash codes and hashing

On this page, we'll outline various aspects of the mathematics of hash codes which can be important in making implementation decisions about hash functions and hash tables.

Hash code distribution

Recall that an ideal hash function, for a random key, should produce a determined hash code that is randomly distributed among the possible hash values. This implies:

overall, each bit of a hash code should have a roughly 50-50 chance of being 0 or 1 (though of course for the entire range of hashes to be used, there will be indivudal keys with more 0s or more 1s);
for keys n and n+1, there should ideally be no fixed relationship between their corresponding hash codes: changing a bit in the key should affect essentially random bits in the hash code.

A poor hash function that has a bias towards certain hash codes will increase the chance of collisions (two hash codes needing to occupy the same slot in the hash table). In practice, in many applications it is sufficient to accept somewhat imperfect hash functions that are quick to compute. (Historically, "quick to compute" has meant a function that can be encoded in a few arithmetic instructions (such as adds, shifts and exclusive ORs) without relying on large operands that must be fetched from memory.) As discussed on this section, the String hash function works "well enough" for most purposes. A couple of tricks that help to alleviate a poor hash function are using a prime number as the hash table size, or using a secondary hash function that mixes bits of hash codes. The idea of the latter technique is that if there is good randomness in some of the bit range, this randomness is partly distributed across a greater range of bits¹.

When modelling the property of hash codes and hash tables mathematically, it is usual to assume that the hash function is perfectly-distributed (just as it's common in mechanics to deal with "light, frictionless, inextensible" bodies...). This gives us some kind of "ideal expected performance" that we can measure against. Given such a hash function, we next look at some statistical properties of hash codes and hash tables.

Notes:
1. The Java HashMap implementation uses the latter technique. In earlier versions of Hotspot, the former technique was used for the internal table kept for the String.intern() command.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey