Advanced use of hash codes in Java: doing away with the keys

On the previous page, we discussed some hash code statistics: specifically the number of collisions that we expected if we took n hash codes from a given range of possible values. We mentioned, for example, that if we take in the order of 100,000 random 32-bit hash codes, we'd expect in the order of 2 collisions, with 8 collisions if we take about 200,000, 32 if we take 400,000 etc. Recall that we're talking purely about the number of pairs¹ of objects with identical hash codes, not about the actual number of collisions in the hash table (i.e. the number of buckets that would have more than one item), which we expect to be much greater. A couple of interesting possibilities emerge at this point:

We could use the hash code as the key to the hash table instead of the actual object. This would mean, of course, that there'd be a risk that we'd ask for the object mapped to a given key, and actually get the wrong value (i.e. one mapped to another object with the same hash code). But the risk would be calculable, and might be small enough for some applications.
If the number of collisions was too great for the number of objects we want to key on, we could use a wider hash code. For example, using 64-bit hash codes would mean we could take in the order of a billion random objects before expecting to get a hash code collision (we'd expect an average of 2 collisions with 2³³ = around 8 billion objects).

So why just store hash codes rather than the keys if we know it's not completely accurate? Well, partly for speed (comparing two numbers is quicker than comparing two strings) but mainly to reduce memory usage in the case where our keys are relatively large. Potentially we can also reduce the number of objects created to store the structure, if we're prepared to write our own map implementation: if our keys are primitive types (ints or doubles), we can actually store them in a single array. Of course, storing only the hash codes would also mean that we couldn't iterate over the keys as we can with a regular Java hash map or hash set. So the technique would only be suitable for retrieving the mapped values, or for determining if a given key is in the map.

On the next pages, we consider keying on hash code and using using hash codes with a BitSet to significantly reduce memory usage for the purpose of duplicate elimination.

1. We're of course making an approximation: we're not distinguishing between the case of n colliding pairs, and the case where more than 2 objects have an identical hash code. Our calculations are based on the assumption that each collision involves a different pair of colliding objects. For most practical cases, this turns out to be a good approximation.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey