Bloom Filters: the false positive rate (ctd)

Running our simulation to measure the false positive rate of Bloom filters under different configurations, Figure 1 shows how the percentage varies as a function of size of the filter and number of hash codes. In effect, the first value along the X axis represents a bit set size of 1 bit per item, the second value (2¹⁵) 2 bits per item, the third (2¹⁶) 4 bits per item etc¹.

Figure 1: False positive rate of a Bloom filter representing a set of 2¹⁴ strings, as a function of number of bits allocated to the filter and the number of hash codes.

As intuitively expected, assigning only a very small number of bits per item results in very high false positive rates. With one or two bits per item, adding more hash codes is also of no benefit: it simply serves to "fill up" the bit set more quickly and hence increases rather than decreases the false positive rate. At 4 bits per item, adding more hash codes initially does improve the false positive rate, up to an optimum value of 3 hash codes and a false positive rate of 14.6%. Beyond this, we have gone past the "point of no return" where more hash codes effectively fill the bit space too quickly.

With 8 bits per item, the optimum false positive rate (2.14%) occurs with 6 hash codes. However, as is often the case, there is a range of values where we might argue that the benefit of adding an extra hash code rapidly diminishes. Figure 2 shows more clearly how false positive rate varies as a function of the number of hash codes, for Bloom filters with 8 bits and 16 bits per item. After a clear benefit for the first 2 or 3 hash codes, the curve rapidly plateaus in both cases.

Figure 2: False positive rate by number of hash codes, showing the increasingly diminishing returns of adding an extra hash code up to the optimum value.

The information we have seen means that for many applications, reasonable Bloom filter configurations will have between 4 and 16 bits per item and between 2 and 4 hash codes. Table 1 below gives the actual false positive rates observed² in our experiment for values within these ranges:

Hash codes	4 bits/item	8 bits/item	16 bits/item
2	15.4%	4.8%	1.4%
3	14.6%	3.0%	0.49%
4	15.8%	2.4%	0.24%

Note that the mean bits per item does not actually need to be a power of 2: intermediate values, giving intermediate false positive rates, are in principle possible.

1. Of course, this actually means that a size of 2²³, or 1024 bits per item, is vastly unrealistic: our implementation is based on 64-bit hash codes, so our purpose of using a Bloom filter is to improve on the 64 bits or so per item that would ordinarily be required for a "normal" hash set that stored only the hash codes.
2. All of the percentages given are the mean of 10 runs of testing 1 million random strings (so 10 million strings in total) as described in the previous section.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey