Home  Localisation intro

Search this site:
Threads Database Profiling Regular expressions Random numbers Compression Exceptions C Equivalents in Java

The Java Collator class

When applied to localisation, the term collation generally refers to the conventions for ordering strings in a particular language and, by extension, for when to consider to strings to be equal.

When working purely with English texts, it is common not to think very much about collation at all. Most programmers take the order of strings to be that produced by Collections.sort() or determined by the ASCII values of its characters; two strings are considered equal if String.equals() deems them to be so, or if the underlying bytes values of the strings are identical. But there are cases when this clearly isn't adequate, even in English. Consider the following example, where we use the simple Collections.sort() method to sort three of my favourite words:

List list = new ArrayList();
list.add("caffeine");
list.add("café");
list.add("cafeteria");
Collections.sort(list);
System.out.println(list);

When we run this, we get the following output:

[cafeteria, caffeine, café]

Depending on the sorting convention we want to use, there are at least two orderings of these words that would be cosidered an acceptable order, for example, in dictionaries. But sadly the above, with the letter e (albeit with an accent) following the letter f, is not generally one of them...

Correcting the sort order: introducing the Collator

If we're prepared to accept some default behaviour, then fixing the sort order is actually very simple. We obtain an instance of a Collator object and then pass this to the Collections.sort() method:

Collator coll = Collator.getInstance(); 
Collections.sort(list, coll);

With this modification, the sort() method puts the word cafén in a more conventionally acceptable place, before the word caffeine (and in fact before the word cafeteria, treating it as though it were spelt without the accent).

Configuring the collator

So what is a Collator? A collator is essentially an object that knows how to sort and compare strings. The default is appropriate for many applications. However, the collator's behaviour can be customised slightly:

  • you can request a collator that implements standard sorting rules for a particular locale by supplying the corresponding Locale object to the Collator.getInstance() method;
  • you can set the strength of the collator, which effectively means what type of difference needs to exist between two letters/strings for them to be considered different;
  • you can set something called the decomposition rule, which affects how composite characters (such as characters with accents) are treated.

 What do you think of this article? Did it help you? Found a mistake? Feedback and suggestions here


Written by Neil Coffey. Copyright © Javamex UK 2009. All rights reserved.