Transforming data to improve Deflater performance

DEFLATE is a generic algorithm which on the whole works quite well with various types of data. In particular, it can typically compress textual data (including 'pure' human language text and text-based structured formats such as HTML and XML) to around 25-30% of its original size.

Of course, deflate and Deflater provide a generic algorithm designed to cope reasonably with "typical" data. But on its own, Deflater doesn't know about specific properties of your data that might help it achieve better compression. In the real world, most people don't have the time and resources to develop a completely new compression algorithm from scratch. But sometimes it's possible to "go halfway" and transform your data to work better with the deflater.

Using the FILTERED or HUFFMAN_ONLY modes

The zlib library (and hence the Deflater class) allows a couple of strategy settings beyond the default of dictionary-plus-Huffman compression:

So when are these options useful? Well, occasionally we can get our data into such a format that certain bytes are predominant in the stream, but with few "repeating patterns".

Example: compressing text

We'll look here at the specific example of compressing what is basically "ordinary text" in human languages. Specifically we'll look at some XML files containing some minimal markup and slabs of text in common European languages written in Roman script: English, French and Spanish. These all have the characteristic that: (a) they use a relatively small alphabet; (b) the probability of a given letter is highly uneven; (c) the probability of a given letter is somewhat dependent on the previous letter.

The fact that these are XML files means that in addition to letters of the alphabet, the data actually contains a few extra markup symbols, but we'll treat the problem as though we were just dealing with "pure" text. The approach still works quite well, and compressing XML files is a useful and common thing to want to do.

As mentioned previously, throwing text through a bog-standard Deflater actually performs reasonably. So if you have absolutely no development time available and need to compress text, you can typically get it down to 25-30% of its original size "for free". But it turns out that with a little bit of extra judicious processing, you can squeeze another few per cent out of the Deflater.

We essentially rely on some knowledge about the data that the Deflater doesn't have: namely that– and the way in which– the probability of a given letter depends on what the previous letter was. As an example, the graph below shows the frequency of various 'next letters' following the letter 't' in a sample of text (in this case the language was French).

Eyeballing the graph, we see that frequency generally tails off pretty sharply after the first couple of letters. The second-most-frequent letter is more or less half as frequent as the most frequent, and a similar relationship exists between the next pair. After that, we don't quite have the perfect negative-powers-of-two distribution that would be the optimum case for Huffman coding. But we do have a graph that starts off with halving frequencies, and then frequencies are pretty low in any case after that. It turns out that if we consider the distribution of "next letters" after most letters, they have a graph quite similar to that for 't'.

In other words, there's a transformation that we can apply to our data that makes it a fairly good candidate for Huffman encoding. We turn our data into a list of bytes where each byte represents the index of the given character in an ordered list such as in the X axis of the graph above. In other words, if the last byte in the stream was a 't' and the next byte is a space, we write 0 because a space is the 0th most frequent character after a 't'; if after a 't' we have an 'e' we write 1, etc.

We end up with a data stream that hugely favours low numbers, in particular sequences of consecutive 0s. This pattern of data then compresses well with the FILTERED option of Deflater. In my test sample of text extracted from HTML pages containing news articles1, I got the following improvements in compression ration over "standard" deflation:

LanguageImprovement in compression
English11.9 %
French20.4 %
Spanish13.0 %

Of course, it's also possible that these improvements reflect other features of the text sources in question other than the language. But there does seem to be a general trend: if we're prepared to put in a bit of extra effort, we can improve on the standard Deflater at relatively low cost in terms of development time. It's a fairly simple task to write a routine to throw the frequencies of character-given-last-character into an array and then output indices from this array– certainly a much simpler task than writing and debugging the Huffman coding routines (or other compression routiens) yourself...!

Of course, this compression technique relies on "shared knowledge" in the form of the frequency tables which would need to be transmitted along with the data. However, these can be got down to around 3K in the typical case, so if you're transmitting a reasonable amount of data there is definitely still a saving.

1. You'll see from the graph that in fact, my text extraction process wasn't perfect, and some basic HTML tags remain in the text (notice that the end-of-tag character figures in the statistics). But if anything, this suggests that "cleaner" text would actually compress a little better using this technique.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.