Repetition operators (ctd):
greedy and reluctant operators

A problem that you'll come across sooner or later with repetition operators in regular expressions occurs when the expression has various operators that could match a variable number of characters. In such cases there is potential ambiguity as to which operator matches what.

We've actually already met such an example. Recall our expression to match a string containing ten digits which we composed as follows:

.*[0-9]{10}.*

Now at this point, it may occur to you that .* can match "any sequence of any character". So if we have a string, say, aab0123456789, why doesn't the initial .* "swallow up" the entire string in one go, preventing a match?

The answer comes in the form of the following matching rules:

So, let's look at what this means. Supposing we match the following string against the above expression:

aax11012345678922bbx

The string contains a total of 14 digits: the string 0123456789, with 11 and 22 either side. So what happens when we come to match? Well, going from left to right, the first 'item' in the expression is .*. How many characters does it match? Well, as many as it can without preventing the other parts from matching if they can. So the first .* matches up to the end of the digits, minus ten, the number that [0-9]{10} requires in order to match. Then, the latter item takes its ten digits, 2345678922 in the above string. Finally, the second .* can match the rest of the string.

So supposing we wanted [0-9]{10} to match against the first ten digits? One way is to transform the "greedy" .* into a so-called reluctant operator.

Turning greed into reluctance

A reluctant operator matches against as few characters as it can, while still letting the rest of the expression match if it can. To make an operator reluctant, we add a question mark after the operator. So the following expression:

.*?[0-9]{10}.*

means that the first .* matches against as few characters as it can, whilst still allowing [0-9]{10} to match against ten digits, and still allowing .* to match against "any sequence". The fewest number of characters that .*? can match against whilst leaving ten subsequent digits is the sequence aax; then, the digits matched by the middle element are 1101234567.

Alternative to reluctant operators

A sometimes clearer alternative to using reluctant operators in some cases is to replace the dot with a more exact character class. For example, we could write the following:

[^0-9]*[0-9]{10}.*

Recall from our section on character classes that [^0-9] means "any character that isn't a digit". So we match any sequence of non-digits, followed by ten digits (in effect, the first ten in the string), followed by the rest of the string.

Why do I need to know which part matches what?

Controlling which part of the expression matches which part of the string is important when you use a feature called capturing.

To understand capturing, we need to start by looking at how to use two explicit classes to control regular expression matching: the Pattern and Matcher classes.


If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.