Tokenising a string with regular expressions

It is possible to use a regular expression to split or tokenise a string, with similar but more flexible functionality to that of StringTokenizer. To split a string, we can write a line such as the following:

String[] words = str.split("\\s+");

The regular expression that we pass to the String's split() method defines the pattern that we want to appear between tokens. In case you haven't come across it, the sequence \s denotes any whitespace character, which can include line breaks and tabs. (It is written with a double backslash inside a literal string to distinguish it from other escape sequences that can go inside strings.) For more information, see the page on named character classes.

Because we can put basically any regular expression to define split points, this makes the String.split() method more flexible than using a regular StringTokenizer. For example, the following denotes that tokens must be separated by between two and four spaces:

String[] words = str.split(" {2,4}");

Having the tokens directly in an array makes the syntax much less fussy for looping through the tokens. Using the Java 5 foreach syntax, we can write:

for (String word : str.split("\\s+")) {
  ...
}

Compare this to the klutsy syntax we'd have to use with a StringTokenizer.

Performance

The String.split() method is more flexible than a StringTokenizer. But of course, this flexibility comes at a price. Using String.split() is around twice as slow. The next page discusses the performance of string splitting in more detail.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey