Regular expression example: IP location
This example shows how we can use a regular expression to guess the country code
of the host requesting a web page by looking at the referrer string.
The referrer string is essentially the URL which the user
clicked on in order to reach a given web page. Rightly or wrongly,
most browsers pass this information on to the web server with every page request.
Here are some examples of referrer strings:
http://es.search.yahoo.com/search?p=country+music
http://uk.search.yahoo.com/search?p=jacques+chirac
http://www.google.co.in/search?hl=en&q=java+programming
http://www.google.com.au/search?hl=en&q=sidney+shopping
http://www.google.bg/search?hl=bg&q=red+wine
As you can see, where a site has been reached via a search engine, we can look
at which search engine the user was using as a clue to their location.
Of course, this isn't perfect: there's nothing to stop a user from Spain from
using a Bulgarian search engine or vice versa. But it turns out that many users
are surprisingly patriotic about which search engine they use. If you are running
a site that is primarily reached via search engines and you don't want to go to
the hassle of installing a database of IP addresses to country codes, looking at
the referrer string will is a reasonable compromise. So let's see how we'd
construct some regular expressions to pull out the country code from strings
such as the above.
The yahoo format referrer string
In these typical examples, yahoo's format is a little simpler than Google's.
For our purposes, we're really only interested in the two letters before
search.yahoo.com, which appears to be fixed. So here is a possible expression:
Pattern p = Pattern.compile("http://" +
"([a-z]{2})" +
"\\.search\\.yahoo\\.com/.*");
In all the examples here, the URL is prefixed with http://. But we could
be more flexible by making this part optional. Remember that to do so, we need
to create a non-capturing group and then
suffix it with a ?:
Pattern p = Pattern.compile("(?:http://)?" +
"([a-z]{2})" +
"\\.search\\.yahoo\\.com/.*");
Either way, the two-character country code will be captured as group 1. Note that
in this case we aren't interested in the search parameters or indeed anything that
occurs after the domain name. We match as far as the end of the domain name and
the slash (.com/) to be sure that this really is a referral from a yahoo
search engine. But then we simply end the expression in .* to match
any subpath and/or parameters of the referrer path.
We'll see on the next page that in the case of parsing
the google referrer string, we may also want to pull out one of the parameters.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.