Secure hash functions
If we use a strong hash function such as
the 64-bit hash mentioned, we expect the chances of stumbling upon two items with an
identical hash to be negligible in practice. But for some applications, we actually
need it to be impossible in practice to find a collision. Such applications
include:
- Representing an item by its hash, but where a hash collision is more critical.
For example, consider the databases of "illegal files" kept by law enforcement agencies
so that they can search people's computers for those files. In practice,
it would probably not be feasible to store and run the search against the actual
bytes of every file in the universe that the Chinese government deems unacceptable.
So instead, a database of hashes is used. For various reasons, it could be
a "bad thing" if the text file containing your chocolate chip cookie recipe was
mistakenly identified as an "incitement to hatred" by our friends at Scotland Yard.
(Really, my chocolate chip cookies aren't that bad...)
So in practice, we want every distinct file content found on any computer in the world to
have a distinct hash code.
- Where a hash is uesd to validate data. For example, when sending
data across a network, it's common to append a hash of that data. The recipient can
then re-hash the data and check that his calculated hash matches that sent by the
sender. If they don't match, either a network error occurred, or an eavesdropper tampered
with the data. It would be a bad thing if an eavesdropper could change the data without
us realising (which could occur if they were able to find some other sequence of data
that hashed to the same value).
- In some protocols where broadly speaking
we want to commit to an answer without revealing what answer we're committing to.
The essential idea is that we can initially communicate the hash of the answer
we're committing to, and then later communicate the actual answer.
A common variant is where we want to check a password entered by a user is correct, but we don't
actually want to store the password on disk; in this case, if our hash function is
secure, we can store a hash of the password; then, at the time of login,
if the hash of the password they enter matches the hash on disk, we assume they
knew the password, as we assume it would be impossible for somebody to come up with an aribitrary
password to match a given hash. Similarly, in a voting system,
different counters could commit to having counted a particular number without actually
revealing what number they had counted until the moment of final correlation.
(In practice, we usually don't hash literally just the information: we
need to take extra steps in both cases to
protect against so-called dictionary attacks.)
Note that what we call secure hash functions here
are sometimes called cryptographic
hash functions, one-way hash functions
or message digests.