Advanced use of hash codes in Java: keying on hash code

On the previous page, we mentioned the possibility of reducing memory usage of a hash map by keying only on the 32-bit hash code. We trade less memory needed to store the keys for a loss of functionality: we can't iterate over the keys (since we're not storing them!), and there's a calculable risk that a given key will actually fetch or override the value of a different key.

Let's suppose we want to gather some statistics from a web server. We want to be able to query the number of times that, say, a given search term was entered into our search engine. Or the number of times that a given referrer string was what led the user to our site. We could use a plain old HashMap for this, mapping from a String to an Integer count. Alternatively, we could declare our hash map as mapping from integer to integer. When we put an item into the map (or increase the count), we call the string's hashCode() method directly and use the result as the key:

private Map<Integer,Integer> termCounts =
  new HashMap<Integer,Integer>(10000);

public void increaseCount(String term) {
  int hc = term.hashCode();
  synchronized (termCounts) {
    Integer cnt = termCounts.get(hc);
    termCounts.put(hc, (cnt == null) ? 1 : (cnt+1));
  }
}

Now, to find out how many times a particular term was entered, we perform a similar operation, getting the value corresponding to term.hashCode(). If an average search term is 20 characters (stored as 40 bytes per String), then storing instead as a 4-byte integer will save us 32 bytes per term. After 50,000 terms, that saves us about 1.5MB. And after that many terms, there'd be around a 50% chance of a collision: i.e. that there'd be a particular term that would incorrectly return and update the count of a different term.

Now, that might not sound like the greatest saving in the universe. But the more interesting saving comes if we use an alternative data structure. Instead of storing the counts in a regular HashMap, which would thus store tens of thousands of individual objects, we can store the data in a small number of arrays. Effectively, we build our own hash structure, but allocate "memory" for the buckets and items in them from an array rather than creating actual Java objects.

Duplicate elimination with hash codes only

Another interesting case is the HashSet used in duplicate elimination in which the keys are in effect not mapped values. On the next page, we consider duplicate elimination using hash codes only as the means of identifying if an object is in the set of "already seen" values.