Bloom Filters: an example Java implementation

Having introduced the concept of a Bloom filter, we now look at an example Java implementation. We will consider the case of representing a set of strings.

Generating hash codes

The basic component that we need to implement our Bloom filter is a way of generating an arbitrary number of good-quality hash codes for a given string. Previously, we discussed a strong hash code implementation which produced 64-bit hash codes based on a table of random 64-bit numbers. We can easily modify this method to produce an arbitrary number of hash codes simply by extending the table of random values. In the original implementation, we use each consecutive byte of the string as an index into 256 random values. In our modified version, we have a table of 256 * k random values (where k is the number of hash codes) and then for byte value b of the nthe hash code, we look up the value at index 256 * n + b. The code for this part of the implementation looks as follows:

public class BloomFilter {
  private static final int MAX_HASHES = 8;
  private static final long[] byteTable;
  private static final long HSTART = 0xBB40E64DA205B064L;
  private static final long HMULT = 7664345821815920749L;
  
  static {
    byteTable = new long[256 * MAX_HASHES];
    long h = 0x544B2FBACAAF1684L;
    for (int i = 0; i < byteTable.length; i++) {
      for (int j = 0; j < 31; j++)
        h = (h >>> 7) ^ h; h = (h << 11) ^ h; h = (h >>> 10) ^ h;
      byteTable[i] = h;
    }
  }

  private long hashCode(String s, int hcNo) {
    long h = HSTART;
    final long hmult = HMULT;
    final long[] ht = byteTable;
    int startIx = 256 * hcNo;
    for (int len = s.length(), i = 0; i < len; i++) {
      char ch = s.charAt(i);
      h = (h * hmult) ^ ht[startIx + (ch & 0xff)];
      h = (h * hmult) ^ ht[startIx + ((ch >>> 8) & 0xff)];
    }
    return h;
  }
}

Recalling that each char making up our string is actually two bytes in width, the scary lines with the bit manipulation effectively pull out the two bytes from each char and use them as an index into the random value table as described above.

The value of MAX_HASHES is arbitrary. In practice, as we'll see below, sensible numbers of hash codes will generally lie within the range 1-8.

Implementing the constructor, add() and contains() functions

Now we have a way of generating k hash codes for a given string, the add() and contains() methods are relatively straightforward. Internally, our Bloom filter consists of a BitSet. We allow instances of our filter to be initialised with the number of bits and number of hashes. For efficiency, we actually require that the number of bits be a power of 2, as with our hash table implementation. Thus, the first constructor will take the base 2 logarithm of the number of bits (so that passing in a value of 16 gives you a Bloom filter of size 2^16 etc):

public class BloomFilter {
  ... code as above ...

  private final BitSet data;
  private final int noHashes;
  private final int hashMask;

  public BloomFilter(int log2noBits, int noHashes) {
    if (log2noBits < 1 || log2noBits > 31)
      throw new IllegalArgumentException("Invalid number of bits");
    if (noHashes < 1 || noHashes > MAX_HASHES)
      throw new IllegalArgumentException("Invalid number of hashes");

    this.data = new BitSet(1 << log2noBits);
    this.noHashes = noHashes;
    this.hashMask = (1 << log2noBits) - 1;
  }

Now, as with bucket index calculations in our hash table implementation, we can calculate bit indexes from hash codes simply by ANDing with the mask. In actual practice, we may not want to have to pass in the logarithm of the number of bits, so we also implemet a constructor that in effect calculates the required value from a maximum number of items and a number of bits per item:

  public BloomFilter(int noItems, int bitsPerItem, int noHashes) {
    int bitsRequired = noItems * bitsPerItem;
    if (bitsRequired >= Integer.MAX_VALUE) {
      throw new IllegalArgumentException("Bloom filter would be too big");
    }
    int logBits = 4;
    while ((1 << logBits) < bitsRequired)
      logBits++;
    if (noHashes < 1 || noHashes > MAX_HASHES)
      throw new IllegalArgumentException("Invalid number of hashes");
    this.data = new BitSet(1 << logBits);
    this.noHashes = noHashes;
    this.hashMask = (1 << logBits) - 1;
  }
}

Now, the add() and contins() methods are trivial: we simply calculate the required hash codes, each time deriving a setting/checking the corresponding bit index:

  public void add(String s) {
    for (int n = 0; n < noHashes; n++) {
      long hc = hashCode(s, n);
      int bitNo = (int) (hc) & this.hashMask;
      data.set(bitNo);
    }
  }

  public boolean contains(String s) {
    for (int n = 0; n < noHashes; n++) {
      long hc = hashCode(s, n);
      int bitNo = (int) (hc) & this.hashMask;
      if (!data.get(bitNo)) return false;
    }
    return true;
  }

Next: analysis of false positive rates

In the next section, we look at the behaviour of the false positive rate of our Bloom filter implementation.