Regex Group Search






Regular Expressions are powerful and can be useful for many different things, i.e. searching for group patterns on web pages. www.hitta.se is a popular swedish web site for searching phone numbers in Sweden. Correctly called it can return search results as a long list making it a good example site for demonstrating how group search in regular expressions can be used.

In this example you first need to declare a simple class called Person. This class will contain all information for each person gathered from the web site.

public class Person
{
  public string Name { get; set; }
  public string Address { get; set; }
  public string PostalAddress { get; set; }
  public string FixedPhone { get; set; }
  public string MobilePhone { get; set; }
}

You also need to declare usage of the following namespaces.

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using System.Net;

Further down on this page you’ll see the complete method code. The code is somewhat simplified and you should always include exception handling, especially when making online Internet requests. The first lines makes a WebRequest to www.hitta.se and retrieves a string of html data. Search my other posts on Internet downloads to get more detailed information on that area.

If you’re new to Regular Expressions then I suggest you read through this Regular Expressions Overview first. Here we’ll just go through the expressions very briefly. Starting at line 15 a pattern for each group is created. In this example each line is almost identical so we’ll only check out the first line. One assumption for the code to work is that each group variable is present even though they contain no value.

string regexp = "LabelFirstName\">(?(.*?)) .*?";






The pattern can be divided into the following sub parts:

  • LabelFirstName\”> Pattern preceding the group variable.
  • (?<FirstName>(.*?)) Creating a group variable named FirstName matching the pattern (.*?). The dot includes all characters including NewLine (since the Singleline option is used) and the * character makes zero to infinite combinations of the . character a valid match. The final questionmark is important since it marks that the matching pattern is nongreedy. Since there are several matches to the pattern succeeding the variable, the questionmark makes the shortest answer the most valid one.
  • </span>.*? Pattern succeeding the variable. This pattern is closed with a pattern to match all characters up until the next variable (declared on the following rows in the complete code). This pattern is nongreedy as well.

When defining the Regex object we also set some Regex options. The most important RegexOption is Singleline since it tells Regex to match the whole html code as a single unit and not match line by line (default behaviour). Each of our groups is spread across several rows so line by line search wouldn’t find any results at all.

Regex r = new Regex(regexp, RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);

And then it’s just to loop through the results found in the html code. The variables defined earlier can be accessed by calling Result() with a format expression shown in the code snippet below.

for (Match m = r.Match(data); m.Success; m = m.NextMatch())
{
  // ...
  p.FixedPhone = m.Result("${FixedAreaCode}${FixedNetNumber}").Trim();
  // ...
}

To call the function just send the variables corresponding to what you should have written on the web site.

List = FindPerson("Eric Smith", "Stockholm");

And finally the complete code…

public static List FindPerson(string what, string where)
{
  List list = new List();

  // Make webrequest
  HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://www.hitta.se/SearchWhite.aspx?vad=" + what + "&var=" + where + "&Rows=100");
  req.UserAgent = "My Client";
  HttpWebResponse resp = (HttpWebResponse) req.GetResponse();

  // Put response in a string. The .NET default encoding UTF-16 wont work here
  StreamReader sr = new StreamReader(resp.GetResponseStream(),Encoding.GetEncoding("iso-8859-1"));
  string data = sr.ReadToEnd();

  // Create search pattern for a single group
  string regexp = "LabelFirstName\">(?(.*?)) .*?";
  regexp += "LabelMiddleName\">(?(.*?)).*?";
  regexp += "LabelLastName\">(?(.*?)).*?";
  regexp += "LabelTitle\">(?(.*?)).*?";
  regexp += "LabelFixedAreaCode\">(Telefon: )?(?(.*?)).*?";
  regexp += "LabelFixedNetNumber\">(?(.*?))()?.*?";
  regexp += "LabelMobileAreaCode\">(Mobil: )?(?(.*?)).*?";
  regexp += "LabelMobileNetNumber\">(?(.*?)).*?";
  regexp += "LabelStreetName\">(?(.*?)).*?";
  regexp += "LabelStreetNumber\">(?(.*?)).*?";
  regexp += "LabelStreetSuffix\">(?(.*?)).*?";
  regexp += "LabelZipCode\">(?(.*?)).*?";
  regexp += "LabelMunicipality\">(?(.*?)).*?";

  // Singleline option is important
  Regex r = new Regex(regexp, RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);
  for (Match m = r.Match(data); m.Success; m = m.NextMatch())
  {
    Person p = new Person();
    // Combine groups
    p.Name = m.Result("${FirstName} ${MiddleName} ${LastName}${Title}").Replace(" ", " ").Trim();
    p.Address = m.Result("${StreetName} ${StreetNumber} ${StreetSuffix}").Replace(" "," ").Trim();
    p.PostalAddress = m.Result("${ZipCode} ${City}").Trim();
    p.FixedPhone = m.Result("${FixedAreaCode}${FixedNetNumber}").Trim();
    p.MobilePhone = m.Result("${MobileAreaCode}${MobileNetNumber}").Trim();
    list.Add(p);
  }
  return list;
}