Bloom Filters: the false positive rate

In the previous section, we looked at a Java implementation of a Bloom filter to represent a set of strings. In this section, we look more closely at how the bloom filter performs as we change two key parameters:

Intuitively, we can envisage the following general principles coming into play:

To see what this all adds up to in practice, we look at the example of adding 65,536 (=214) random strings to a Bloom filter whose size varies from 214 to 223 bits, and where the number of hash codes varies from 1 to 8. For each of these combinations of size and number of hash codes, we create the corresponding Bloom filter then add 65,536 items to it. We also instantiate a regular HashSet and add the strings to this, so that we can determine what strings we "really" added to the Bloom filter and thus detect false positives. Then, the final step is to generate a number (in this case 1 million) random strings. For each of these not in the set (which of course we expect to be the vast majority), we count the number reported to be present by the Bloom filter: these are the false positives. The code looks essentially like this:

Random r = new SecureRandom();
final int noItems = 1 << 14;

for (int log2bits = 14; log2bits <= 23; log2bits++) {
  for (int noHashes = 1; noHashes <= 8; noHashes++) {
    double noFalsePositives = 0;
    int noNotIn = 0;

    BloomFilter bf = new BloomFilter(log2bits, noHashes);
    Set already = new HashSet(noItems);
    // Add items fo Bloom filter
    for (int itemNo = 0; itemNo < noItems; itemNo++) {
      String s = randomString(r);
      already.add(s);
      bf.add(s);
    }
    // Now test for false positives
    for (int n = 0; n < NO_FALSE_POSITIVE_TESTS; n++) {
      String s = randomString(r);
      if (!already.contains(s)) {
        noNotIn++;
        if (bf.contains(s)) noFalsePositives++;
      }
    }
    double falsePositiveRate = noNotIn == 0 ? 0d :
      noFalsePositives / noNotIn;
  }
}

Notice that we use SecureRandom rather than the regular java.lang.Random class. Due to weaknesses in the LCG algorithm used by java.lang.Random, the latter is not suitable for this kind of simulation where we need to generate a large numbe of highly random combinations.

To create our random strings, we use the following method: we arbitrarily choose to create strings whose length centres around 5 (a "normal word length") and whose characters are randomly chosen from "regular letters" plus a few accented characters. In other words, we aim to create "vaguely typical" strings without going to too much effort1.

private static final String LETTERS =
  "abcdefghijklmnopqrstuvexyABCDEFGHIJKLMNOPQRSTUVWYXZzéèêàôû";
private static String randomString(Random r) {
  int wordLen;
  do {
    wordLen = 5 + 2 * (int) (r.nextGaussian() + 0.5d);
  } while (wordLen < 1 || wordLen > 12);
  StringBuilder sb = new StringBuilder(wordLen);
  for (int i = 0; i < wordLen; i++) {
    char ch = LETTERS.charAt(r.nextInt(LETTERS.length()));
    sb.append(ch);
  }
  return new String(sb);
}

Next: graph of results

On the next page, we show the result of this experiment, and hence how false positive rate varies as a function of filter size and number of hash codes.


1. Of course in reality, typical words of a given language behave in a much more complex manner than this: the probability of a given letter is highly dependent on the previous letter, for example.