How `Deflater` works (ctd)

On the previous page, we took a brief overview of dictionary compression in the DEFLATE algorithm. After applying dictionary compression, the DEFLATE algorithm applies a form of Huffman encoding, also known as Huffman compression.

It is worth understanding Huffman encoding because one of the options that the Deflater offers is to apply Huffman encoding only.

Huffman encoding

Huffman encoding is one of the most famous types of compression system. Even if it is not used on its own, it is often used in conjunction with another compression scheme (as indeed in the case here).

Huffman encoding works on a stream of bits. It takes an "alphabet" of symbols (for example, the possible "symbols" could be the bytes 0-255, but it could be some arbitrary range of numbers/values), and re-codes each symbol in the alphabet as a sequence of bits, so that the sequence corresponding to each symbol reflects the frequency of that symbol. The most frequently occurring symbol will be represented by the shortest bit sequence, and with the length of each bit sequence being roughly inversely proportional to its frequency.

Possible Huffman encodings can be worked out by creating a "Huffman tree" as follows:

Start with a list of "nodes" consisting of the symbols and their frequencies.
Create a new node connecting the two least frequently occurring symbols.
Label the branches with 0 and 1 (adopting some convention for whether the branch to the least or most frequent symbol gets 0): in this example, we'd connect A and C first.
The newly created node gets a frequency which is the sum of the frequencies of the two children: in this case it would be 2+3=5 (Fig 1).
Repeat these steps until you end up with a single node which is the "root" of the tree. So first we'd connect the newly created node with B (5 and 9 are still the lowest frequencies), as in Fig 2; Fig 3 shows the completed tree.
Start at the root and trace down the branches until you get to a "leaf" node with one of the symbols: the sequence of 0s and 1s on the branches that you passed on the way give the sequence for encoding that symbol. So in Fig 3, tracing from the root to the symbol B gives a code of 01; C will have a code of 000; D, the most frequently occurring symbol, will have a single-bit code, 1.

[Fig 1]

[Fig 2]

[Fig 3]

When does Huffman coding perform well?

The idea of Huffman encoding is to assign each symbol a representation that is inversely proportional to its frequency. But in reality, this is not usually possible.

The problem is that the exact number of bits requred for a given symbol's length to be inversely proportional to frequency could actually be a fractional number. Huffman codes are only actually optimal when the probabilities of symbols are negative powers of 2 (1/2, 1/4 etc): in other words, where there's one symbol occuring around 50% of the time, then another around 25% of the time, etc.

However, if you do have data distributed in this way, you may be able to take advantage of the Deflater's HUFFMAN_ONLY mode. You may also be able to transform your data so that the distribution of bytes is closer to the negative power of 2 relationship just described. On the next page, I discuss a simple transformation that generally improves text compression with the Deflater by between 10 and 20%.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey

How Deflater works (ctd)

Huffman encoding

When does Huffman coding perform well?

How `Deflater` works (ctd)