Introduction to hashing and hash maps: string hash functions

Writing an adequate hash function for strings

Having seen the obvious problem of a hash function that just takes the string length as the hash code, we outlined that the general aim of a hash function can be seen as to distribute codes randomly over the entire range for given random inputs. Of course, for a given key X, the hash function must always generate the same hash code Y. But the point is that Y should essentially be a "random number across the possible range of hash codes". For typical keys, we don't want our hash codes to tend to "clump" within a certain range, or in binary terms, for certain bits of the hash code to tend to be always 0 or always 1. (If you're not familiary with binary and the way computers store numbers, it's worth looking at this site's tutorial on binary and number representation to help you understand the rest of this section.)

On the other hand, we also require our hash function to be fairly efficient. In practice, "reasonably random but quick to compute" is the tradeoff we want. For example, we'll see below that so long as a fair proportion of the bits of the hash code are random, then this randomness can be "ironed out" across all of the bits, and for most cases that's good enough.

So... back to the hash codes of strings...

Enough with the theory. Let's have a look at how the Java String class actually computes its hash codes. The algorithm goes something like his:

public int hashCode() {
  int hash = 0;
  for (int i = 0; i < length(); i++) {
    hash = hash * 31 + charAt(i);
  }
  return hash;
}

This isn't literally the code, because inside the String class, the code can access the characters of the string more efficiently than with the public charAt() method. And after it is calculated, the hash code is cached. But this is the essence of how the hash codes of Java strings are calculated.

To start with, without delving into all the details of this function, we can at least notice that:

the function takes account of the values of all the characters in the string;
on each cycle, the function involves two different operations (in fact, arguably the multiplication can be analysed as a shift plus subtract– see the section on how hash functions work) and tries to introduce interactions between different bits of successive characters that will provoke (or at least spread) randomness (we'll look at exacltly how below).

Obviously, there will be specific sets of strings where this is a bit unnecessary. For example, if all your keys were telephone numbers with the same area code, then the first few characters will be identical for each key and won't help to make the hash codes more random. Or in the case of our country names, just taking the first two letters actually differentiates many countries. But the String class can't make assumptions such as this: it has to take the generic approach to try and work with pretty much any set of strings.

Like any generic algorithm, this one has some worst cases. But it turns out that for many "typical" sets of strings, the algorithm used by the String class works quite well. That is, it tends to generate hash codes with reasonably random lower bits (for reasons we'll see later, those are generally the ones that count most), and irons out some of the non-randomness in the distribution of characters in the string.

In fact, it works well enough that, if you have some objects that you want to use as keys and can't think what else to use as a hash code, appending the data fields to a string and using that string as the key can sometimes be a starting point to get you past the problem temporarily, even if not the most efficient or elegant solution.

At this point, there are a couple of different routes you can take:

If you just want to take a pragmatic approach and use Strings as keys to your map data without worrying about any more of the ins and outs of hashing, then you can start using the HashMap class immediately;
If you want to use objects of your own classes as a key but without worrying too much more about the technical details of hash functions, then you can see some practical hash function guidelines and a guide to overriding the hashCde() and equals() methods, which is effectively what you need to do to "plug in" your hash function and make your class work as a hash map key.
You may wish to understand more of the technical details of how hash functions work and what the basis is behind the guidelines mentioned.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey