A strong 64-bit hash function in Java: example hash function

On the previous page, we explained the potential benefit of a 64-bit hash code: if the hash code is "strong" enough, then different objects are practically guaranteed to have unique hash codes for reasonably-sized collections of objects. This means that we can create hash tables and other related structures that can store only the hash codes of their keys, as opposed to the standard Java implementation which must store the key.

But how do we actually implement a strong hash function? Our goal is to create a hash function which, given two objects, will tend to assign each a different hash code, and where those hash codes will be dispersed: the relationship between an object and its hash code appears to all intents and purposes "random". From this point of view, there is a close relationship between a hash function and a typical software random number generator (RNG): the latter attemps to take a number and from it produce the next number in the series with a seemingly "random" (but actually predictable) relationship between the two. We can therefore see a hash function as an RNG that starts with a fixed seed and at each step of the way, produces the next "random" number in the series and combines it with the next byte or field from the data to be hashed.

One way of combining the data with an RNG is to use a linear congruential generator (LCG). Recall that in an LCG, each successive "random" number is usually produced by multiplying the previous number by a fixed, carefully chosen multiplier and then adding a constant. On each iteration, instead of adding the same constant, we can instead combine a constant dependent on the next byte or item from the data to be hashed. (We change the constant, rather than the multiplier, bceause the quality of an LCG is highly dependent on the multiplier.)

An implementation of this scheme suggested in Numerical Recipes (Section 7.6) uses a table of 256 random values indexed by successive bytes in the data, and recommends a multiplier suitable for an LCG with a modulus of 2⁶⁴ (which is effectively what we have when we mulitply using a 64-bit long). A possible implementation in Java runs along these lines:

public class DataHasher {
  private static final long[] byteTable = createLookupTable();
  private static final long HSTART = 0xBB40E64DA205B064L;
  private static final long HMULT = 7664345821815920749L;

  public static long hash(byte[] data) {
    long h = HSTART;
    final long hmult = HMULT;
    final long[] ht = byteTable;
    for (int len = data.length, i = 0; i < len; i++) {
      h = (h * hmult) ^ ht[data[i] & 0xff];
    }
    return h;
  }
}

As mentioned, the value of HMULT is found to be good in practice with 64-bit LCGs. It has roughly half its bits set and is "virtually" prime (it is composed of three prime factors). The value of HSTART isn't motivated by the authors, and I believe that essentially any value would do. But what about the byteTable?

Strong hash code implementation: continued

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey