Improving our hash function
On the previous page, we looked at the structure of a hash map,
which assigns each key/value pair to a number of "buckets" or linked lists based on the hash function.
In our simple but impractical example, we took the length of the
string as the hash function.
If we solve the problem of collisions by having a linked list in each bucket,
then taking the string length as the hash function will theoretically work.
But it has an obvious problem if, say, we want our keys to be, say,
100,000 words of English.
In this case, our linked lists at array position 6,
as well as those corresponding to other common string lengths, will still have thousands
of entries, while higher numbers will have practically none. Sure, it's better to
scan a list of 10,000 entries than a list of 100,000 when looking for the key. But really,
our hash function of taking the string length isn't adequate. We need to choose a better
hash function that ideally has these properties:
- It can distribute the keys more or less evenly over the range
of positions in the array. We want to avoid the situation where position 6 has a list of
10,000 entries and position 13 has a list of only 20 entries.
- It can distribute the keys over a larger range of values. If there are 100,000 words
in the map and a perfectly-distributed hash function, then having 50 positions in
the array would still give an average list size of 2,000 items. But if we could have 5,000
positions, there'd be just 20 items in each list. And if we had in the region of
100,000 positions then
we'd on average make just one or two comparisons1 to find the mapping we were looking for.
In practice, we combine these problems and define our task as coming up with
a hash function that distributes hash codes evenly over the entire range of possible
integers (in Java, that means a 4-byte number, or one with about 4 billion possible
values)2. Then we assume that whatever the size of the array in our hash map
(whatever the modulus), the mappings will be distributed more or less evenly.
On the next page, we'll look at the string hash function
used by the Java String class, and examine some of the features that make it
more adequate.
Notes:
1. Given an ideal hash code that distributes keys randomly over the entire hash code
range, there is actually a stricter mathematical relationship between (a) the number of
buckets, (b) the number of items to distribute among those buckets, and
(c) the number of buckets that on average will have a given number of items in them.
We'll explore this relationship later.
2. Put slightly more mathematically, we want to avoid certain bits in the integer being
"biased". That is, we want there to be a roughly 50% chance of a given bit being set
in the hash code given any particular input.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.