The Java Collator class
When applied to localisation, the term collation generally refers to the conventions for
ordering strings in a particular language and, by extension, for when to consider to strings
to be equal.
When working purely with English texts, it is common not to think very much about
collation at all. Most programmers take the order of strings to be that produced by
Collections.sort() or determined by the ASCII values of its characters; two strings
are considered equal if String.equals() deems them to be so, or if the underlying
bytes values of the strings are identical. But there are cases when this clearly isn't
adequate, even in English. Consider the following example, where we use the simple
Collections.sort() method to sort three of my favourite words:
List list = new ArrayList();
list.add("caffeine");
list.add("café");
list.add("cafeteria");
Collections.sort(list);
System.out.println(list);
When we run this, we get the following output:
[cafeteria, caffeine, café]
Depending on the sorting convention we want to use, there are at least two orderings of
these words that would be cosidered an acceptable order, for example, in dictionaries.
But sadly the above, with the letter e (albeit with an accent) following the letter f,
is not generally one of them...
Correcting the sort order: introducing the Collator
If we're prepared to accept some default behaviour, then fixing the sort order is actually
very simple. We obtain an instance of a Collator object and then pass this to the
Collections.sort() method:
Collator coll = Collator.getInstance();
Collections.sort(list, coll);
With this modification, the sort() method puts the word cafén in
a more conventionally acceptable place, before the word caffeine (and in fact before
the word cafeteria, treating it as though it were spelt without the accent).
Configuring the collator
So what is a Collator? A collator is essentially an object that knows how to
sort and compare strings. The default is appropriate for many applications. However, the
collator's behaviour can be customised slightly:
- you can request a collator that implements standard sorting rules for
a particular locale by supplying the corresponding Locale object to
the Collator.getInstance() method;
- you can set the strength of the collator, which effectively means
what type of difference needs to exist between two letters/strings for them to be
considered different;
- you can set something called the decomposition rule, which affects
how composite characters (such as characters with accents) are treated.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.