Repetition operators (ctd):
greedy and reluctant operators
A problem that you'll come across sooner or later with repetition operators in regular expressions occurs when the expression has various operators that could match a variable number of characters. In such cases there is potential ambiguity as to which operator matches what.
We've actually already met such an example. Recall our expression to match a string containing ten digits which we composed as follows:
Now at this point, it may occur to you that .* can match "any sequence of any
character". So if we have a string, say, aab0123456789, why doesn't the initial
.* "swallow up" the entire string in one go, preventing a match?
The answer comes in the form of the following matching rules:
- operators match from left to right;
- repetition operators are greedy: they match as many characters as they can;
- but, operators are not allowed to prevent a match if one is possible.
So, let's look at what this means. Supposing we match the following string against
the above expression:
The string contains a total of 14 digits: the string 0123456789, with
11 and 22 either side. So what happens when we come to match?
Well, going from left to right, the first 'item' in the expression is .*.
How many characters does it match? Well, as many as it can without preventing the
other parts from matching if they can. So the first .* matches up to the
end of the digits, minus ten, the number that [0-9]{10} requires in
order to match. Then, the latter item takes its ten digits, 2345678922
in the above string. Finally, the second .* can match the rest of the string.
So supposing we wanted [0-9]{10} to match against the
first ten digits? One way is to transform the "greedy" .*
into a so-called reluctant operator.
Turning greed into reluctance
A reluctant operator matches against as few
characters as it can, while still letting the rest of the expression match if
it can. To make an operator reluctant, we add a question mark
after the operator. So the following expression:
means that the first .* matches against as few characters as it can,
whilst still allowing [0-9]{10} to match against ten digits, and still
allowing .* to match against "any sequence". The fewest number of
characters that .*? can match against whilst leaving ten subsequent
digits is the sequence aax; then, the digits matched by the middle element
are 1101234567.
Alternative to reluctant operators
A sometimes clearer alternative to using reluctant operators in some cases
is to replace the dot with a more exact character class. For example, we could
write the following:
Recall from our section on character
classes that [^0-9] means "any character that isn't a digit".
So we match any sequence of non-digits, followed by ten digits
(in effect, the first ten in the string), followed by the rest of the string.
Why do I need to know which part matches what?
Controlling which part of the expression matches which part of the string
is important when you use a feature called capturing.
To understand capturing, we need to start by looking at how to
use two explicit classes to control regular expression matching: the
Pattern and Matcher classes.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.