In the previous section, we looked at a Java implementation of a Bloom filter to represent a set of strings. In this section, we look more closely at how the bloom filter performs as we change two key parameters:

- the number of
**bits per item**: that is, how many times more bits there are in the filter than the maximum number of items we want to represent in the set; - the number of
**hash codes**, and hence the number of bits that we actually set for each object that we add to the set.

Intuitively, we can envisage the following general principles coming into play:

- the
**fuller**the bit set is (or put another way, the fewer bits we allow per item), the**greater the chance of false positives**as we "accidentally" marking items as present as we add more items to the set; - when the bit set is
**relatively empty**(with more bits per item), then the more bits we require to be marked as set in order to mark an item as "present"— i.e. the more**hash codes**per item— the**lower the chance of false positives**, because for a given item potentially in the set, there's less chance of some random combination of bits from other items also accidentally marking that item as present when it isn't; - for a given filter size,
there's a
**point of no return**, at which having more hash codes simply means that we fill up the bit set too quickly as we add items— and hence get more false positives— than with fewer hash codes.

To see what this all adds up to in practice, we look at the example of adding
65,536 (=2^{14}) random strings to a Bloom filter whose size varies from
2^{14} to 2^{23} bits, and where the number of hash codes varies
from 1 to 8. For each of these combinations of size and number of hash codes, we
create the corresponding Bloom filter then add 65,536 items to it. We also
instantiate a regular `HashSet` and add the strings to this, so that we
can determine what strings we "really" added to the Bloom filter and thus detect
false positives. Then, the final step is to generate a number (in this case 1 million)
random strings. For each of these *not* in the set (which of course we expect to
be the vast majority), we count the number reported to be present by the Bloom filter:
these are the false positives. The code looks essentially like this:

Random r = new SecureRandom(); final int noItems = 1 << 14; for (int log2bits = 14; log2bits <= 23; log2bits++) { for (int noHashes = 1; noHashes <= 8; noHashes++) { double noFalsePositives = 0; int noNotIn = 0; BloomFilter bf = new BloomFilter(log2bits, noHashes); Setalready = new HashSet (noItems); // Add items fo Bloom filter for (int itemNo = 0; itemNo < noItems; itemNo++) { String s = randomString(r); already.add(s); bf.add(s); } // Now test for false positives for (int n = 0; n < NO_FALSE_POSITIVE_TESTS; n++) { String s = randomString(r); if (!already.contains(s)) { noNotIn++; if (bf.contains(s)) noFalsePositives++; } } double falsePositiveRate = noNotIn == 0 ? 0d : noFalsePositives / noNotIn; } }

Notice that we use `SecureRandom` rather than the regular
`java.lang.Random` class. Due to weaknesses in
the LCG algorithm used by `java.lang.Random`, the latter is not suitable
for this kind of simulation where we need to generate a large numbe of highly
random combinations.

To create our random strings, we use the following method: we arbitrarily
choose to create strings whose length centres around 5 (a "normal word length")
and whose characters are randomly chosen from "regular letters" plus a few
accented characters. In other words, we aim to create "vaguely typical" strings
without going to too much effort^{1}.

private static final String LETTERS = "abcdefghijklmnopqrstuvexyABCDEFGHIJKLMNOPQRSTUVWYXZzéèêàôû"; private static String randomString(Random r) { int wordLen; do { wordLen = 5 + 2 * (int) (r.nextGaussian() + 0.5d); } while (wordLen < 1 || wordLen > 12); StringBuilder sb = new StringBuilder(wordLen); for (int i = 0; i < wordLen; i++) { char ch = LETTERS.charAt(r.nextInt(LETTERS.length())); sb.append(ch); } return new String(sb); }

On the next page, we show the result of this experiment, and hence how false positive rate varies as a function of filter size and number of hash codes.

1. Of course in reality, typical words of a given language behave in a much more complex manner than this: the probability of a given letter is highly dependent on the previous letter, for example.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.