The mathematics of hash codes and hashing
On this page, we'll outline various aspects of the mathematics of hash codes which can
be important in making implementation decisions about hash functions and hash tables.
Hash code distribution
Recall that an ideal hash function, for a random key, should produce a determined
hash code that is randomly distributed among the possible hash values. This implies:
- overall, each bit of a hash code should have a roughly 50-50 chance
of being 0 or 1 (though of course for the entire range of hashes to be used,
there will be indivudal keys with more 0s or more 1s);
- for keys n and n+1, there should ideally be no fixed relationship
between their corresponding hash codes: changing a bit in the key should affect
essentially random bits in the hash code.
A poor hash function that has a bias towards certain hash codes will increase
the chance of collisions (two hash codes needing to occupy the same slot
in the hash table). In practice, in many applications it is sufficient to accept
somewhat imperfect hash functions that are quick to compute. (Historically, "quick to
compute" has meant a function that can be encoded in a few arithmetic instructions
(such as adds, shifts and exclusive ORs) without relying on large operands that
must be fetched from memory.) As discussed on this section, the
String hash function works "well enough"
for most purposes. A couple of tricks that help to alleviate a poor hash function
are using a prime number as the hash table size, or using a secondary hash function
that mixes bits of hash codes. The idea of the latter technique is that if there is good randomness
in some of the bit range, this randomness is partly distributed across a greater
range of bits1.
When modelling the property of hash codes and hash tables mathematically, it is
usual to assume that the hash function is perfectly-distributed (just as it's
common in mechanics to deal with "light, frictionless, inextensible" bodies...).
This gives us some kind of "ideal expected performance" that we can measure against.
Given such a hash function, we next look at some
statistical properties of hash codes and hash tables.
Notes:
1. The Java HashMap implementation uses the latter technique.
In earlier versions of Hotspot, the former technique was used for the internal table
kept for the String.intern() command.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.