Writing an adequate hash function for strings

Having seen the obvious problem of a hash function that just takes the string length as the hash code, we outlined that the general aim of a hash function can be seen as to distribute codes randomly over the entire range for given random inputs. Of course, for a given key X, the hash function must always generate the same hash code Y. But the point is that Y should essentially be a "random number across the possible range of hash codes". For typical keys, we don't want our hash codes to tend to "clump" within a certain range, or in binary terms, for certain bits of the hash code to tend to be always 0 or always 1. (If you're not familiary with binary and the way computers store numbers, it's worth looking at this site's tutorial on binary and number representation to help you understand the rest of this section.)

On the other hand, we also require our hash function to be fairly efficient. In practice, "reasonably random but quick to compute" is the tradeoff we want. For example, we'll see below that so long as a fair proportion of the bits of the hash code are random, then this randomness can be "ironed out" across all of the bits, and for most cases that's good enough.

So... back to the hash codes of strings...

Enough with the theory. Let's have a look at how the Java String class actually computes its hash codes. The algorithm goes something like his:

public int hashCode() {
  int hash = 0;
  for (int i = 0; i < length(); i++) {
    hash = hash * 31 + charAt(i);
  }
  return hash;
}

This isn't literally the code, because inside the String class, the code can access the characters of the string more efficiently than with the public charAt() method. And after it is calculated, the hash code is cached. But this is the essence of how the hash codes of Java strings are calculated.

To start with, without delving into all the details of this function, we can at least notice that:

Obviously, there will be specific sets of strings where this is a bit unnecessary. For example, if all your keys were telephone numbers with the same area code, then the first few characters will be identical for each key and won't help to make the hash codes more random. Or in the case of our country names, just taking the first two letters actually differentiates many countries. But the String class can't make assumptions such as this: it has to take the generic approach to try and work with pretty much any set of strings.

Like any generic algorithm, this one has some worst cases. But it turns out that for many "typical" sets of strings, the algorithm used by the String class works quite well. That is, it tends to generate hash codes with reasonably random lower bits (for reasons we'll see later, those are generally the ones that count most), and irons out some of the non-randomness in the distribution of characters in the string.

At this point, there are a couple of different routes you can take: