Efficient substring matching with regular expressions in Java
In our introduction to Java regular expressions, we mentioned that one of the advantages
of using regular expressions was efficiency. This may sound surprising at first glance. In order to
interpret and apply our regular expression, the regex API clearly has to do some non-trivial work
over a naive algorithm that simply reads and compares or counts characters in sequence.
In a comparison of String.splut() vs StringTokenizer,
while the String.split() method performed admirably, it was still half the speed of the
(decprecated and less flexible) StringTokenizer.
You may therefore be asking: how efficient are regular expressions for performing typical string
matching operations?
In fact, they are often surprisingly efficient. As we will see below, the regex API
contains optimisations that can make it more scalable than a naive implementation of the equivalent task.
As an example, let us consider the common case of substring matching, where we wish to determine
the locations where a particular substring occurs within another larger string.
A naive routine to find substring matches might look as follows:
public static List<Integer> findMatchPoints(String str, String searchFor) {
List matchPoints = new ArrayList<>();
next_location:
for (int i = 0; i < max; i++) {
for (int j = 0; j < searchFor.length(); j++) {
if (str.charAt(i + j) != searchFor.charAt(j)) {
continue next_location;
}
}
matchPoints.add(i);
}
return matchPoints;
}
On short strings, this naive algorithm may be sufficient and even outperform its regular expression
equivalent. But from a scalability perspective, it is inefficient: in the worst case of no matches,
it will compare every single character in the input string with every single character in the substring being
matched. Or put another way: this naive algorithm is inefficient because, whenever a non-match occurs at a particular
position, it "throws away" potential information that was gathered along the way (of the non-matching subsequence,
how many characters did match, and can this be used as a hint as to where to resume searching for the next
potential match site?).
For sure, we could implement a more efficient algorithm from scratch (for example, the Knuth-Morris-Pratt
algorithm or the Boyer-Moore-Horspool algorithm are two approaches). But the regular expression API already
offers such an algorithm out of the box. To gain the benefit, we can replace our method with the following:
public static List<Integer> findMatchPoints(CharSequence str, String searchFor) {
Pattern p = Pattern.compile(searchFor);
Matcher m = p.matcher(str);
return m.results().map(MatchResult::start).collect(Collectors.toList());
}
Let us consider a slightly contrived example:
String: SPOONS AND SPIN SPAN IN PINS AND SNIPS, SPAN INTO SIPS
Search for: SPAN
Our naive algorithm finds the two match sites in 64 comparisons, while the regex implementation (as of Java 9) finds them
in 28 comparisons. In a real-world application where we needed to perform multiple comparisons, we would consider
other optimisations such as re-using the same compiled Pattern instance where possible.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.