Having introduced the concept of a Bloom filter, we now look at an example Java implementation. We will consider the case of representing a set of strings.

The basic component that we need to implement our Bloom filter is a way of
generating an arbitrary number of good-quality hash codes for a given string.
Previously, we discussed a strong
hash code implementation which produced 64-bit hash codes based on a table
of random 64-bit numbers. We can easily modify this method to produce an arbitrary
number of hash codes simply by extending the table of random values. In the original
implementation, we use each consecutive byte of the string as an index into 256
random values. In our modified version, we have a table of 256 * *k* random
values (where *k* is the number of hash codes) and then for byte value *b*
of the *n*the hash code, we look up the value at index 256 * *n* + *b*.
The code for this part of the implementation looks as follows:

public class BloomFilter { private static final int MAX_HASHES = 8; private static final long[] byteTable; private static final long HSTART = 0xBB40E64DA205B064L; private static final long HMULT = 7664345821815920749L; static { byteTable = new long[256 * MAX_HASHES]; long h = 0x544B2FBACAAF1684L; for (int i = 0; i < byteTable.length; i++) { for (int j = 0; j < 31; j++) h = (h >>> 7) ^ h; h = (h << 11) ^ h; h = (h >>> 10) ^ h; byteTable[i] = h; } } private long hashCode(String s, int hcNo) { long h = HSTART; final long hmult = HMULT; final long[] ht = byteTable; int startIx = 256 * hcNo; for (int len = s.length(), i = 0; i < len; i++) { char ch = s.charAt(i); h = (h * hmult) ^ ht[startIx + (ch & 0xff)]; h = (h * hmult) ^ ht[startIx + ((ch >>> 8) & 0xff)]; } return h; } }

Recalling that each `char` making up our string is actually two bytes
in width, the scary lines with the bit manipulation effectively pull out the
two bytes from each `char` and use them as an index into the random value
table as described above.

The value of `MAX_HASHES` is arbitrary. In practice, as we'll see below,
sensible numbers of hash codes will generally lie within the range 1-8.

Now we have a way of generating *k* hash codes for a given string,
the `add()` and `contains()` methods are relatively straightforward.
Internally, our Bloom filter consists of a `BitSet`. We allow instances
of our filter to be initialised with the number of bits and number of hashes.
For efficiency, we actually require that the number of bits be a power of 2,
as with our hash table implementation. Thus, the first constructor will take
the base 2 logarithm of the number of bits (so that passing in a value of 16 gives
you a Bloom filter of size 2^16 etc):

public class BloomFilter { ... code as above ... private final BitSet data; private final int noHashes; private final int hashMask; public BloomFilter(int log2noBits, int noHashes) { if (log2noBits < 1 || log2noBits > 31) throw new IllegalArgumentException("Invalid number of bits"); if (noHashes < 1 || noHashes > MAX_HASHES) throw new IllegalArgumentException("Invalid number of hashes"); this.data = new BitSet(1 << log2noBits); this.noHashes = noHashes; this.hashMask = (1 << log2noBits) - 1; }

Now, as with bucket index calculations in our hash table implementation, we can calculate bit indexes from hash codes simply by ANDing with the mask. In actual practice, we may not want to have to pass in the logarithm of the number of bits, so we also implemet a constructor that in effect calculates the required value from a maximum number of items and a number of bits per item:

public BloomFilter(int noItems, int bitsPerItem, int noHashes) { int bitsRequired = noItems * bitsPerItem; if (bitsRequired >= Integer.MAX_VALUE) { throw new IllegalArgumentException("Bloom filter would be too big"); } int logBits = 4; while ((1 << logBits) < bitsRequired) logBits++; if (noHashes < 1 || noHashes > MAX_HASHES) throw new IllegalArgumentException("Invalid number of hashes"); this.data = new BitSet(1 << logBits); this.noHashes = noHashes; this.hashMask = (1 << logBits) - 1; } }

Now, the `add()` and `contins()` methods are trivial: we simply calculate
the required hash codes, each time deriving a setting/checking the corresponding bit
index:

public void add(String s) { for (int n = 0; n < noHashes; n++) { long hc = hashCode(s, n); int bitNo = (int) (hc) & this.hashMask; data.set(bitNo); } } public boolean contains(String s) { for (int n = 0; n < noHashes; n++) { long hc = hashCode(s, n); int bitNo = (int) (hc) & this.hashMask; if (!data.get(bitNo)) return false; } return true; }

In the next section, we look at the behaviour of the false positive rate of our Bloom filter implementation.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.