Writing an adequate hash function for strings
Having seen the obvious problem of a hash function that just takes the
string length as the hash code, we outlined that the general
aim of a hash function can be seen as to distribute codes randomly over the
entire range for given random inputs. Of course, for a given key X, the hash
function must always generate the same hash code Y. But the
point is that Y should essentially be a "random number across the possible range
of hash codes". For typical keys, we don't want our hash codes to
tend to "clump" within a certain range, or in binary terms, for certain bits
of the hash code to tend to be always 0 or always 1. (If you're not
familiary with binary and the way computers store numbers, it's worth
looking at this site's tutorial on binary and
number representation to help you understand the rest of this section.)
On the other hand, we also require our hash function to be fairly
efficient. In practice, "reasonably random but quick to
compute" is the tradeoff we want. For example, we'll see below that so long as
a fair proportion of the bits of the hash code are random, then
this randomness can be "ironed out" across all of the bits, and for most cases
that's good enough.
So... back to the hash codes of strings...
Enough with the theory. Let's have a look at how the Java String
class actually computes its hash codes. The algorithm goes
something like his:
public int hashCode() {
int hash = 0;
for (int i = 0; i < length(); i++) {
hash = hash * 31 + charAt(i);
}
return hash;
}
This isn't literally the code, because inside the String
class, the code can access the characters of the string more efficiently than
with the public charAt() method. And after it is calculated, the hash
code is cached. But this is the essence of how the hash codes of
Java strings are calculated.
To start with, without delving into all the details of this function, we can at least
notice that:
- the function takes account of the values of all
the characters in the string;
- on each cycle, the function involves two different operations
(in fact, arguably the multiplication can be analysed as a shift plus subtract–
see the section on how hash functions work)
and tries to introduce interactions between different bits
of successive characters that will provoke (or at least spread) randomness (we'll look at exacltly
how below).
Obviously, there will be specific sets of strings where this is a bit
unnecessary. For example, if all your keys were telephone numbers with
the same area code, then the first few characters will be identical for
each key and won't help to make the hash codes more random. Or in the
case of our country names, just taking the first two letters actually
differentiates many countries. But the String class can't
make assumptions such as this: it has to take
the generic approach to try and work with pretty much
any set of strings.
Like any generic algorithm, this one has some worst cases.
But it turns out that for many "typical" sets of strings, the algorithm
used by the String class works quite well. That is, it tends to
generate hash codes with reasonably random lower bits (for reasons we'll see
later, those are generally the ones that count most), and irons out some
of the non-randomness in the distribution of characters in the string.
- In fact, it works well enough that, if you have some objects that you
want to use as keys and can't think what else to use as a hash code,
appending the data fields to a string and using that string as the key
can sometimes be a starting point to get you past the problem temporarily,
even if not the most efficient or elegant solution.
At this point, there are a couple of different routes you can take:
- If you just want to take a pragmatic approach and
use Strings as keys to your map data without worrying about any
more of the ins and outs of hashing,
then you can start using the HashMap class
immediately;
- If you want to use objects of your own classes as a key but without
worrying too much more about the technical details of hash functions, then
you can see some practical hash function
guidelines and a guide to overriding the
hashCde() and equals() methods, which is effectively what
you need to do to "plug in" your hash function and make your class work
as a hash map key.
- You may wish to understand more of the technical details of
how hash functions work and what the basis
is behind the guidelines mentioned.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.