Advanced use of hash codes in Java: doing away with the keys

On the previous page, we discussed some hash code statistics: specifically the number of collisions that we expected if we took n hash codes from a given range of possible values. We mentioned, for example, that if we take in the order of 100,000 random 32-bit hash codes, we'd expect in the order of 2 collisions, with 8 collisions if we take about 200,000, 32 if we take 400,000 etc. Recall that we're talking purely about the number of pairs1 of objects with identical hash codes, not about the actual number of collisions in the hash table (i.e. the number of buckets that would have more than one item), which we expect to be much greater. A couple of interesting possibilities emerge at this point:

So why just store hash codes rather than the keys if we know it's not completely accurate? Well, partly for speed (comparing two numbers is quicker than comparing two strings) but mainly to reduce memory usage in the case where our keys are relatively large. Potentially we can also reduce the number of objects created to store the structure, if we're prepared to write our own map implementation: if our keys are primitive types (ints or doubles), we can actually store them in a single array. Of course, storing only the hash codes would also mean that we couldn't iterate over the keys as we can with a regular Java hash map or hash set. So the technique would only be suitable for retrieving the mapped values, or for determining if a given key is in the map.

On the next pages, we consider keying on hash code and using using hash codes with a BitSet to significantly reduce memory usage for the purpose of duplicate elimination.


1. We're of course making an approximation: we're not distinguishing between the case of n colliding pairs, and the case where more than 2 objects have an identical hash code. Our calculations are based on the assumption that each collision involves a different pair of colliding objects. For most practical cases, this turns out to be a good approximation.