Bloom Filters: the false positive rate (analysis)

On the previous page we looked at behaviour of the false positive rate of our Java Bloom filter implementation. Ultimately, this behaviour is a combination of various factors involved: not only the "theoretical" behaviour assuming truly random selection of objects and perfect hash codes, but to a small extent, the actual distribution of possible object values and the quality of our hash function. So I think that looking at some "real life" data is important. However, for completeness, I will look briefly here at how we can approximate the behaviour of the Bloom filter from a more mathematical point of view.

It turns out that the exact mathematics of how Bloom filters perform is rather complex. Anyone with a deep interest in this subject is encouraged to consult Bose et al (2007), On the false-positive rate of Bloom filters. For our purposes, we will make do with two simple approximations, which turn out to be good enough in most cases.

Mathematical approximation of the false positive rate

There are two simple mathematical approximations to the false positive rate. The logic in each case is similar and both approximations give similar results, differing subtly in the assumptions they make.

The first approximation appears in Bloom's original 1970 paper. Consider a Bloom filter with k hash codes and an underlying bit set of m bits. This means that each time we add an item, we expect to set k bits, or proportionally, k/m of the bits. So after the first item is inserted, we assume the proportion of bits still not set is 1-k/m. After the second item is inserted, we assume that proportion will be (1-k/m)(1-k/m). After the third item, (1-k/m)(1-k/m)(1-k/m) etc, so that after n items are inserted, we expect the proportion of bits still to be clear to be (1-k/m)n.

This is an oversimplification on two grounds. It ignores the fact that very occasionally two or more of the k bits set for a given item will coincide and we won't actually change the state of k bits. (Clearly, this chance becomes more miniscule as m is much greater than k, which will generally be the case.) Perhaps more significantly, it assumes that bits are set independently of one another, whereas in reality, the more bits that are set, the more chance of a bit "colliding" with one already set and not altering the proportion of bits set. But for Bloom filters that aren't "too full", it's a good enough approximation. Then, given a random item that we happen to know isn't actually in the set, the chance that our Bloom filter will erroneously report that it is present is equal to the likelihood that all corresponding k bits for that item are set. In other words, if (1-k/m)n is the proportion of bits that are clear, then for a each individual bit to be set it must be in the remaining 1-(1-k/m)n proportion of bits. Thus the overall likelihood of all k bits being set is assumed to be this proportion multiplied by itself k times:

approx f. p. rate = (1-(1-k/m)n)k

The second approximation— and the one that appears to be most common in descriptions of Bloom filters— follows a slightly different line of logic but arrives at a similar result.