Problems with ZIP files (in Java)

The ZIP file format is generally a widely-accepted, useful format. However, it does have a couple of quirks that you should be aware of. Both relate in some way to internationalisation.

Encoding of filenames

Many operating systems now allow non-ASCII characters in filenames. If you're not too familiar with character sets and character encoding, ASCII characters are essentially unaccented letters, numbers and a few symbols (plus a few "unprintable" control codes that we'll ignore for now). As bytes, they're usually all encoded in the range 0-127. Or put less technically, they're the characters "available on the average 1980s micro sold in the US".

In the early days of operating systems such as DOS or various home computing platforms, using anything other than these characters (such as accented characters, other alphabets etc) was non standardised and not well supported. Putting things such as accented characters and even spaces in file names was just "something that you didn't do", and was often not actually possible. So formats such as ZIP tended not to worry too much about the issue.

Unfortunately, nowadays it generally is possible to put any character in file names and in strings in general, and there are various different standards for dealing with this (common standards include Unicode encoded with UTF-8, ISO-88591-1 which is a one-byte-per-character encoding for various European languages, etc). For ZIP files, there's no standard encoding, and no standard way to indicate which encoding you've used. So different tools will generally pick some encoding arbitrarily. If the tool you use to create the ZIP file name has the same encoding as the one used to read it, then all is generally merry. If not, beware the dragons.

Java currently expects UTF-8 encoding. This generally means that if you create a ZIP file in Windows, filenames with non-ASCII characters will be mangled.

The issue has been raised as Java bug ID 4244499, with a proposal to add a ZipFile constructor to take the character encoding (assuming the caller knows it). At the moment, a possible solution is to use the Arcmexer library, which allows you to set the file name encoding. This also has the advantage of being able to read encrypted ZIP files.

My advice is generally:

Don't use accents or other non-ASCII characters in filenames1! You really don't need them that much.

One historical problem is that filenames were never really intended to be a human-readable "title", but increasingly, that's how many "average users" are treating them. For the time being, there's no elegant solution to this problem, and the best we can really do is live with filenames with missing diacritics or occasional spelling changes to fit ASCII.

Time zones

A similar issue occurs with time zones. Timestamps in ZIP files are stored in milliseconds since a particular reference point, but there's no way to say "relative to which time zone". Most ZIP tools will simply use— when either reading or writing— the time zone of the local machine at the time of creating the ZIP file.

Again, a possible solution is to use Arcmexer, which allows you to specify the locale in which the zip file is assumed to have been created, and will convert times to the current locale accordingly.

1. By the way, call me a fuddy-duddy, but I would also say don't use spaces in filenames. I guess I just use command-line tools too much.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.