Introduction to hashing and hash maps: improving the hash function

Improving our hash function

On the previous page, we looked at the structure of a hash map, which assigns each key/value pair to a number of "buckets" or linked lists based on the hash function.

In our simple but impractical example, we took the length of the string as the hash function. If we solve the problem of collisions by having a linked list in each bucket, then taking the string length as the hash function will theoretically work. But it has an obvious problem if, say, we want our keys to be, say, 100,000 words of English. In this case, our linked lists at array position 6, as well as those corresponding to other common string lengths, will still have thousands of entries, while higher numbers will have practically none. Sure, it's better to scan a list of 10,000 entries than a list of 100,000 when looking for the key. But really, our hash function of taking the string length isn't adequate. We need to choose a better hash function that ideally has these properties:

It can distribute the keys more or less evenly over the range of positions in the array. We want to avoid the situation where position 6 has a list of 10,000 entries and position 13 has a list of only 20 entries.
It can distribute the keys over a larger range of values. If there are 100,000 words in the map and a perfectly-distributed hash function, then having 50 positions in the array would still give an average list size of 2,000 items. But if we could have 5,000 positions, there'd be just 20 items in each list. And if we had in the region of 100,000 positions then we'd on average make just one or two comparisons¹ to find the mapping we were looking for.

In practice, we combine these problems and define our task as coming up with a hash function that distributes hash codes evenly over the entire range of possible integers (in Java, that means a 4-byte number, or one with about 4 billion possible values)². Then we assume that whatever the size of the array in our hash map (whatever the modulus), the mappings will be distributed more or less evenly.

On the next page, we'll look at the string hash function used by the Java String class, and examine some of the features that make it more adequate.

Notes:
1. Given an ideal hash code that distributes keys randomly over the entire hash code range, there is actually a stricter mathematical relationship between (a) the number of buckets, (b) the number of items to distribute among those buckets, and (c) the number of buckets that on average will have a given number of items in them. We'll explore this relationship later.
2. Put slightly more mathematically, we want to avoid certain bits in the integer being "biased". That is, we want there to be a roughly 50% chance of a given bit being set in the hash code given any particular input.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey