Improving our hash function

On the previous page, we looked at the structure of a hash map, which assigns each key/value pair to a number of "buckets" or linked lists based on the hash function.

In our simple but impractical example, we took the length of the string as the hash function. If we solve the problem of collisions by having a linked list in each bucket, then taking the string length as the hash function will theoretically work. But it has an obvious problem if, say, we want our keys to be, say, 100,000 words of English. In this case, our linked lists at array position 6, as well as those corresponding to other common string lengths, will still have thousands of entries, while higher numbers will have practically none. Sure, it's better to scan a list of 10,000 entries than a list of 100,000 when looking for the key. But really, our hash function of taking the string length isn't adequate. We need to choose a better hash function that ideally has these properties:

In practice, we combine these problems and define our task as coming up with a hash function that distributes hash codes evenly over the entire range of possible integers (in Java, that means a 4-byte number, or one with about 4 billion possible values)2. Then we assume that whatever the size of the array in our hash map (whatever the modulus), the mappings will be distributed more or less evenly.

On the next page, we'll look at the string hash function used by the Java String class, and examine some of the features that make it more adequate.


Notes:
1. Given an ideal hash code that distributes keys randomly over the entire hash code range, there is actually a stricter mathematical relationship between (a) the number of buckets, (b) the number of items to distribute among those buckets, and (c) the number of buckets that on average will have a given number of items in them. We'll explore this relationship later.
2. Put slightly more mathematically, we want to avoid certain bits in the integer being "biased". That is, we want there to be a roughly 50% chance of a given bit being set in the hash code given any particular input.