Bloom Filters: the false positive rate
In the previous section, we looked at a Java implementation
of a Bloom filter to represent a set of strings. In this section, we look more closely at
how the bloom filter performs as we change two key parameters:
- the number of bits per item: that is, how many times more bits
there are in the filter than the maximum number of items we want to represent in the set;
- the number of hash codes, and hence the number of bits that we actually
set for each object that we add to the set.
Intuitively, we can envisage the following general principles coming into play:
- the fuller the bit set is (or put another way, the fewer bits we allow per item),
the greater the chance of false positives as we "accidentally" marking items
as present as we add more items to the set;
- when the bit set is relatively empty (with more bits per item), then the more bits
we require to be marked as set in order to mark an item as "present"—
i.e. the more hash codes per item— the lower
the chance of false positives, because for a given item potentially in the set,
there's less chance of some random
combination of bits from other items also accidentally marking that item as present
when it isn't;
- for a given filter size,
there's a point of no return, at which having more hash codes simply means that
we fill up the bit set too quickly as we add items— and hence get more
false positives— than with fewer hash codes.
To see what this all adds up to in practice, we look at the example of adding
65,536 (=214) random strings to a Bloom filter whose size varies from
214 to 223 bits, and where the number of hash codes varies
from 1 to 8. For each of these combinations of size and number of hash codes, we
create the corresponding Bloom filter then add 65,536 items to it. We also
instantiate a regular HashSet and add the strings to this, so that we
can determine what strings we "really" added to the Bloom filter and thus detect
false positives. Then, the final step is to generate a number (in this case 1 million)
random strings. For each of these not in the set (which of course we expect to
be the vast majority), we count the number reported to be present by the Bloom filter:
these are the false positives. The code looks essentially like this:
Random r = new SecureRandom();
final int noItems = 1 << 14;
for (int log2bits = 14; log2bits <= 23; log2bits++) {
for (int noHashes = 1; noHashes <= 8; noHashes++) {
double noFalsePositives = 0;
int noNotIn = 0;
BloomFilter bf = new BloomFilter(log2bits, noHashes);
Set already = new HashSet(noItems);
// Add items fo Bloom filter
for (int itemNo = 0; itemNo < noItems; itemNo++) {
String s = randomString(r);
already.add(s);
bf.add(s);
}
// Now test for false positives
for (int n = 0; n < NO_FALSE_POSITIVE_TESTS; n++) {
String s = randomString(r);
if (!already.contains(s)) {
noNotIn++;
if (bf.contains(s)) noFalsePositives++;
}
}
double falsePositiveRate = noNotIn == 0 ? 0d :
noFalsePositives / noNotIn;
}
}
Notice that we use SecureRandom rather than the regular
java.lang.Random class. Due to weaknesses in
the LCG algorithm used by java.lang.Random, the latter is not suitable
for this kind of simulation where we need to generate a large numbe of highly
random combinations.
To create our random strings, we use the following method: we arbitrarily
choose to create strings whose length centres around 5 (a "normal word length")
and whose characters are randomly chosen from "regular letters" plus a few
accented characters. In other words, we aim to create "vaguely typical" strings
without going to too much effort1.
private static final String LETTERS =
"abcdefghijklmnopqrstuvexyABCDEFGHIJKLMNOPQRSTUVWYXZzéèêàôû";
private static String randomString(Random r) {
int wordLen;
do {
wordLen = 5 + 2 * (int) (r.nextGaussian() + 0.5d);
} while (wordLen < 1 || wordLen > 12);
StringBuilder sb = new StringBuilder(wordLen);
for (int i = 0; i < wordLen; i++) {
char ch = LETTERS.charAt(r.nextInt(LETTERS.length()));
sb.append(ch);
}
return new String(sb);
}
Next: graph of results
On the next page, we show the result of this experiment, and hence
how false positive rate varies
as a function of filter size and number of hash codes.
1. Of course in reality, typical words of a given language behave in a much more
complex manner than this: the probability of a given letter is highly dependent on the
previous letter, for example.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.