Efficient substring matching with regular expressions in Java

In our introduction to Java regular expressions, we mentioned that one of the advantages of using regular expressions was efficiency. This may sound surprising at first glance. In order to interpret and apply our regular expression, the regex API clearly has to do some non-trivial work over a naive algorithm that simply reads and compares or counts characters in sequence. In a comparison of String.splut() vs StringTokenizer, while the String.split() method performed admirably, it was still half the speed of the (decprecated and less flexible) StringTokenizer.

You may therefore be asking: how efficient are regular expressions for performing typical string matching operations? In fact, they are often surprisingly efficient. As we will see below, the regex API contains optimisations that can make it more scalable than a naive implementation of the equivalent task.

As an example, let us consider the common case of substring matching, where we wish to determine the locations where a particular substring occurs within another larger string.

A naive routine to find substring matches might look as follows:


public static List<Integer> findMatchPoints(String str, String searchFor) {	
    List matchPoints = new ArrayList<>();
next_location:
    for (int i = 0; i < max; i++) {
        for (int j = 0; j < searchFor.length(); j++) {
            if (str.charAt(i + j) != searchFor.charAt(j)) {
                continue next_location;
            }
         }
         matchPoints.add(i);
     }
     return matchPoints;
 }

On short strings, this naive algorithm may be sufficient and even outperform its regular expression equivalent. But from a scalability perspective, it is inefficient: in the worst case of no matches, it will compare every single character in the input string with every single character in the substring being matched. Or put another way: this naive algorithm is inefficient because, whenever a non-match occurs at a particular position, it "throws away" potential information that was gathered along the way (of the non-matching subsequence, how many characters did match, and can this be used as a hint as to where to resume searching for the next potential match site?).

For sure, we could implement a more efficient algorithm from scratch (for example, the Knuth-Morris-Pratt algorithm or the Boyer-Moore-Horspool algorithm are two approaches). But the regular expression API already offers such an algorithm out of the box. To gain the benefit, we can replace our method with the following:


public static List<Integer> findMatchPoints(CharSequence str, String searchFor) {
    Pattern p = Pattern.compile(searchFor);
    Matcher m = p.matcher(str);
    return m.results().map(MatchResult::start).collect(Collectors.toList());
}

Let us consider a slightly contrived example:


	String: SPOONS AND SPIN SPAN IN PINS AND SNIPS, SPAN INTO SIPS
	Search for: SPAN

Our naive algorithm finds the two match sites in 64 comparisons, while the regex implementation (as of Java 9) finds them in 28 comparisons. In a real-world application where we needed to perform multiple comparisons, we would consider other optimisations such as re-using the same compiled Pattern instance where possible.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey