Search this site

 Home  Regex intro  Character classes  Repetition operators  Find/replace  Multiline  Example regex


Repetition operators (ctd):
greedy and reluctant operators

A problem that you'll come across sooner or later with repetition operators in regular expressions occurs when the expression has various operators that could match a variable number of characters. In such cases there is potential ambiguity as to which operator matches what.

We've actually already met such an example. Recall our expression to match a string containing ten digits which we composed as follows:

.*[0-9]{10}.*

Now at this point, it may occur to you that .* can match "any sequence of any character". So if we have a string, say, aab0123456789, why doesn't the initial .* "swallow up" the entire string in one go, preventing a match?

The answer comes in the form of the following matching rules:

  • operators match from left to right;
  • repetition operators are greedy: they match as many characters as they can;
  • but, operators are not allowed to prevent a match if one is possible.

So, let's look at what this means. Supposing we match the following string against the above expression:

aax11012345678922bbx

The string contains a total of 14 digits: the string 0123456789, with 11 and 22 either side. So what happens when we come to match? Well, going from left to right, the first 'item' in the expression is .*. How many characters does it match? Well, as many as it can without preventing the other parts from matching if they can. So the first .* matches up to the end of the digits, minus ten, the number that [0-9]{10} requires in order to match. Then, the latter item takes its ten digits, 2345678922 in the above string. Finally, the second .* can match the rest of the string.

So supposing we wanted [0-9]{10} to match against the first ten digits? One way is to transform the "greedy" .* into a so-called reluctant operator.

Turning greed into reluctance

A reluctant operator matches against as few characters as it can, while still letting the rest of the expression match if it can. To make an operator reluctant, we add a question mark after the operator. So the following expression:

.*?[0-9]{10}.*

means that the first .* matches against as few characters as it can, whilst still allowing [0-9]{10} to match against ten digits, and still allowing .* to match against "any sequence". The fewest number of characters that .*? can match against whilst leaving ten subsequent digits is the sequence aax; then, the digits matched by the middle element are 1101234567.

Alternative to reluctant operators

A sometimes clearer alternative to using reluctant operators in some cases is to replace the dot with a more exact character class. For example, we could write the following:

[^0-9]*[0-9]{10}.*

Recall from our section on character classes that [^0-9] means "any character that isn't a digit". So we match any sequence of non-digits, followed by ten digits (in effect, the first ten in the string), followed by the rest of the string.

Why do I need to know which part matches what?

Controlling which part of the expression matches which part of the string is important when you use a feature called capturing.

To understand capturing, we need to start by looking at how to use two explicit classes to control regular expression matching: the Pattern and Matcher classes.

comments powered by Disqus

Written by Neil Coffey. Copyright © Javamex UK 2012. All rights reserved.