Search this site

 Home  Regex intro  Character classes  Repetition operators  Find/replace  Multiline  Example regex


Regular expression example: IP location

On the previous page, we showed how a regular expression can be uesd to extract the country code from the referrer string, looking at the simplest case of a Yahoo referrer string.

Parsing the Google referrer string

Recall that the Google referrer strings look as follows:

http://www.google.com/search?hl=fr&q=dictionary+french
http://www.google.co.in/search?hl=en&q=java+programming
http://www.google.com.au/search?hl=en&q=sidney+shopping
http://www.google.bg/search?hl=bg&q=red+wine

As you can see:

  • in many cases these referrer strings contain a language code as a parameter which we could additionally use;
  • even if the country-neutral google.com domain is used, a language code can still be specified;
  • in country-specific domains, before the two-digit country code suffix of the domain, there can be an additional suffix (.com or .co).

We'll propose treating these referrers as two types of case:

  • if the domain is google.com, we'll look for a language code (the hl parameter) and use it as a clue to location;
  • otherwise, we'll extract the two-letter country code from the end of the domain.

Of course, this isn't perfect. For example, there are many Spanish speakers living in the southern states of the US who may well use google.com but have the language configured to be es. With our simplistic method here, we'd mistakenly say they were in Spain. In the first URL above, we will say the user is in France on the basis of the language code fr, but they could quite likely be in Canada. And ultimately there's nothing to stop a user in Spanish-speaking Peru using the Australian site google.com.au and configuring their language to be Italian. Slightly erroneously, we're pretending that country and language codes are the same thing; in some cases this isn't true, and in some cases a language can be specified with a locational variant (e.g. fr-CA for Canadian French) which would be a better clue. We'll ignore these issues here. It actually turns out that in many cases, the simplistic methodology we outline here is a reasonable first approximation.

On the next page, we consider in turn these two types of case: google.com with a language code and country-specific google domain.

comments powered by Disqus

Written by Neil Coffey. Copyright © Javamex UK 2012. All rights reserved.