Home  Java compression intro  Deflater how-to  Deflater algorithm  Deflater configuration  Text compression performance  GZIP files  ZIP files

Search this site:
Threads Database Profiling Regular expressions Random numbers Compression Exceptions C Equivalents in Java

How Deflater works

On this page we look briefly at how the compression algorithm which underlies Java's Deflater works. As mentioned in the introduction, Deflater is actually a wrapper around the zlib library, which implements the deflate algorithm.

The deflate algorithm is essentially a combination of two mechanisms:

  • Firstly, dictionary-based compression is applied; this tries to replace recurring patterns in the data with a reference to the first (or a previous) occurrence of the pattern;
  • Then, Huffman encoding is applied: this encodes a byte as a variable number of bits, with more frequently-occurring bytes being given a correspondingly shorter code.

Dictionary compression

The compressor works within a "window" of data. It looks at the upcoming data to be compressed, and looks to see if it already encountered a matching sequence in the data recently processed. For example, consider the following sequence, where the character in bold is the character currently being processed, and those to the left have already been processed:

A B A B C D A B A B C D A...

The compressor sees that it has a sequence "AB" to compress, and that that sequence occurred 2 characters ago. So it can encode a reference looking something like (2, 2) to mean "same sequence as two characters ago, length 2 characters" (of course, that in turn was also probably encoded as a repetition of the earlier AB sequence). However, if it is configured to search deeper, it might also find that "ABC"– or even "ABCD"– occurred 6 characters ago. As you can see, how exactly the compressor matches upcoming input against previous sequences is something that is up to the Deflater (in fact the zlib) implementation, and it turns out that it can be configured to some extent. Specifically, we can configure a Deflater's compression level to indicate how far the deflater looks for matching sequences, resulting in a tradeoff between CPU and compression ratio.

If a character occurs which the compressor decides doesn't match any previous sequence (possibly because it doesn't look far enough), then it is simply encoded as itself.

As you can see from this description, the deflate algorithm will generally favour streams with repeating sequences of bytes. (Note that we've used letters for the sake of illustration, but all byte values are treated equally by the compressor: it doesn't specifically favour those representing printable characters.)

Huffman encoding

After applying dictionary compression, Huffman encoding is applied to the output. On the next page, we look at Huffman encoding in the DEFLATE algorithm.

comments powered by Disqus

Written by Neil Coffey. Copyright © Javamex UK 2012. All rights reserved.