Secure hash functions in Java

As you probably know if you've worked on certain algorithms such as hash tables, a hash function takes pieces of data and "randomly" maps them to integers. Generally, the resulting integers, called hashes or hash codes, are of some specified fixed number of bits. For example, Java collections use 32-bit hash codes.

In cryptography applications, we often need a so-called secure hash function. Secure hash functions are generaelly a subset of hash functions that fulfil at least two extra criteria:

it must be computationally impossible to reverse the mapping, that is, go from a hash code to a message or piece of data that would have generated that hash code;
it must be infeasible for a collision to occur: that is, for two messages to be found that have the same hash code.

In order to fulfil these criteria (or at least, as a by-product of needing to fulfil these criteria), secure hash functions generally have these characteristics:

they are slower to compute than the hash codes typically used to key hash maps;
they are wider (i.e. have more bits) than weak hash codes.

Secure hash codes are typically 128 bits wide at the very least; compare that, for example, to the 32-bit codes returned by Java's hashCode() method, or the 64-bit hash codes recommended for key-less hash maps in Numerical Recipes.

Applications of secure hash functions

Secure hash functions actually have various applications. A very common case is verifying the integrity of data. When we send some data, we append a hash of that data; on the receiving end, we re-hash the received data and check that the computed hash equals that sent; if any of the data has changed then (with overwhelming probability), the computed hash value will no longer match the original. Another case is where we need to authenticate some data, i.e. produce a kind of integrity check that only a party with a given private key could produce. (In this case, the general solution is to combine a hash code with encryption.)

In other cases, a secure hash function is useful to represent a particular item of data. For example, for the purpose of checking passwords, we need only store a hash of that password. When somebody enters their password, if the computed hash of what they entered matches the hash stored in the password file/database, we assume they knew the password.

This scheme, sometimes called compare by hash (CBH) can be used to search for duplicates of data on a hard drive or for synching data between multiple machines. Similarly, another example are the databases that various law enforcement agencies keep of known "disapproved" files that they want to search people's hard drives for. In these applications, keeping a database of the actual file contents, and/or transmitting and comparing those entire contents, would be impractical. Instead, only hashes are stored and compared.

More broadly, secure hash functions are useful in a variety of cases where we need a trapdoor function (i.e. one that cannot feasibly be reversed), especially where we need one with a limited or fixed-size result.

Next: computing hashes and choosing a hash algorithm

On the next page, we look at two issues:

how to compute a secure hash (message digest) in Java;
which hash algorithm to choose: we give an overview of the different options (although in reality, we see that there aren't too many to choose from!).

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey