Advanced use of hash codes in Java: duplicate elimination

A common use of the HashSet is for various types of duplicate elimination. The hash set, like the case of hash map keys, stores items in their entirety. Let's say we're collecting some information about the different clients that connect to our web server. The first time that we see a particular client (identified by IP address for this example), we want to log some information about the client to a database. But if the same client subsequently makes another request, we don't want to log the information again.

Ideally, we don't want to have to hit the database unnecessarily. So as a first attempt, we can use a HashSet to store which IP addresses we've already seen. Let's say we want to log data from about 50,000 clients:

Set<String> ipsSeen = new HashSet<String>(50000);

public void logInformation(String ipAddress, String info) {
  boolean isNew;
  synchronized (ipsSeen) {
    isNew = ipsSeen.add(ipAddress);
  }
  if (isNew) {
    // persist (ipAddress,info) to the database
  }
}

This will reduce the number of database hits because we don't have to hit the database to see if we've already saved information on a particular client. But it's also quite a memory-hungry way of achieving that purpose.

Now, supposing we don't actually store the IP addresses in memory, but just store the hash codes of those we've already seen. (An alert reader will have spotted that in 'raw bytes', a 'normal' IP address (i.e. not an IPv6 address) is actually 4 bytes, i.e. on today's Internet, a numeric IP address can be its own hash code.) So we could use a HashSet<Integer>, calculating the hash code of ipAddress each time before adding it to the set:

  Set<Integer> ipsSeen = new HashSet<Integer>(50000);
  ...
  isNew = ipsSeen.add(ipAddress.hashCode());

Putting just the hsah code in the HashSet is a bit better than the entire string, but it's still quite heavy-handed. On the next page, we'll consider an alternative technique. In effect, we can allocate a hash table that has a very high load factor– i.e. the number of buckets is several times the number of items– and store only one bit per item. This allows us to store a hash set as a BitSet, with some caveats.