Bloom Filters

This section looks at a structure often called a Bloom filter¹. A Bloom filter is an implementation of a set which allows a certain probability of false positives when determining if a given object is a member of the set, in return for a reduction in the amount of memory required for the set. It is essentially an extension of the idea of using hash codes as an index to a BitSet for the purpose of duplicate elimination.

It effectively works as follows:

we allocate m bits to represent the set data;
we write a hash function that, instead of a single hash code, produces k hash codes for a given object;
to add an object to the set, we derive bit indexes from all k hash codes and set those bits;
to determine if an object is in the set, we again calculate the corresponding hash codes and bit indexes, and say that it is present if and only if all corresponding bits are set.

The margin of error (or false positive rate) thus comes from the fact that as we add more and more objects to the set, we increase the likelihood of "accidentally" setting a combination of bits that corresponds to an object that isn't actually in the set. In an example of using a bloom filter to represent a set of strings, we find that allocating 8 bits per item gives us a false positive rate of around 4.9% with 2 hash codes and 2.3% with 4 hash codes². Within certain parameters, we can achieve a certain false positive rate either by increasing the number of bits per item (hence overall memory used) or by increasing the number of hash codes (and hence CPU usage required per lookup).

Bloom filters are useful for "weeding out" items. For example, when placed on top of a database query, they can be used to avoid database "hits" for items that we know in advance (because our Bloom filter reports that they are not present) are not in the database. Because of the "false positive" phenomenon, some acceptable percentage of unnecessary database lookups will still occur, but the number of unnecessary lookups will be reduced. This tradeoff is worthwhile because accessing an area of memory locally is usually much less resource-intensive than performing a database query, which may involve disk accesses and potentially a connection to a remote server.

Now we have introduced the concept, we look at an example Bloom filter implementation in Java. Later on, we also look at the false positive rate of our Bloom filter, and thus how to tune it for our purposes.

1. Bloom filters take their name from Burton Bloom, who appears to be have been the first to formally describe such structures in his paper Space/Time Trade-offs in Hash Coding with Allowable Errors (1970). The name "Bloom filter" appears to be a later coining, since the original paper does not give any specific name for the structure other than "Method 2" of two "hash-coding methods with allowable errors". ("Method 1" was in effect a hash set that stores only the hash codes of items.)
2. Theoretical bounds on positive rates can be calculated given a particular size of filter, number of items and number of hash codes. As I'll discuss below, because of the complexity of factors involved— including the nature of the data and the actual quality of the hash codes in a real life situation— I find it simpler and more useful to give data from a real-life simulation.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey