Code Focused

Extracting Data with Regular Expressions

Regular expressions give Tim Patrick the creeps, but he overcame his fears by discovering specially crafted regex patterns can access data in a way that’s actually kind of cool.

I've always been secretly afraid of regular expressions. I spent many of my earliest programming years working with tenacious development tools in a mainframe-hosted Unix environment. I even used to write AWK scripts. (Pause for gasps of horror.) So, regular expressions should be more of the same. And yet they give me the creeps. That level of fear kept me from learning, until recently, that you can simultaneously validate and extract named data elements from text strings using specially crafted regex patterns. And it turns out that my fears were unfounded, because accessing data in this way is actually kind of cool.

Consider the mundane example of obtaining a 10-digit North American phone number from a user, and parsing out the three-digit area code, the three-digit exchange office and the four-digit subscriber number components. The components can be stored in a specially crafted class:

private class PhoneParts
{
  public string AreaCode; // First 3 digits
  public string Exchange; // Next 3 digits
  public string Number;   // Last 4 digits
}

Once you've obtained some text that the user claims to be a phone number, you can throw away anything that isn’t a digit, confirm that there are 10 digits in all, and locate the components by grabbing substrings of the right lengths from fixed positions:

// ----- This only works on 10-digit numbers.
string cleanText = DigitsOnly(originalText);
if (cleanText.Length == 10)
{
  // ----- Extract out the parts by physical position.
  PhoneParts foundPhone = new PhoneParts()
  {
    AreaCode = cleanText.Substring(0, 3),
    Exchange = cleanText.Substring(3, 3),
    Number = cleanText.Substring(6, 4)
  };
}

The DigitsOnly helper method returns just the digits found in an original text string:

string DigitsOnly(string originalText)
{
  // ----- Return just the digits from a source string.
  string result = "";
  if (originalText == null)
    return result;
  for (int position = 0; position < originalText.Length; position++)
    if (char.IsDigit(originalText[position]))
      result += originalText[position];
  return result;
}

That all works fine. But I had to add a helper method (DigitsOnly) and perform validation using magic numbers (the positions and lengths) to get the code working. It turns out that you can delegate a lot of that work to a well-crafted regular expression:

const string PhonePattern =
  @"^\(?(?!0|1)(?<AreaCode>\d{3})\)?\s?-?(?!0|1)(?<Exchange>\d{3})\s?-?(?<Number>\d{4})$";

The PhonePattern expression looks for the 10 digits of the phone number, granting forgiveness to the user for any hyphens or spaces between the numeric components and allowing a set of parentheses to surround the area code. The pattern also rejects content that includes a zero or one digit at the start of an area code or exchange, because those are forbidden in North American phone numbers.

The magical parts of the expression are the parenthesized capturing groups that include group names in angle brackets:

// ----- A capturing group named "AreaCode," matching three digits.
(?<AreaCode>\d{3})

If the incoming text matches the pattern, each capturing group is saved within the results, tied to the indicated group name. Those matching portions can be accessed by name from the results set:

// ----- Assumes: using System.Text.RegularExpressions;
Regex processor = new Regex(PhonePattern);
Match results = processor.Match(originalText);
if (results.Success)
{
  // ----- Named groups are ready to use.
}

The Regex.Match method analyzes and validates the incoming text, isolates the capturing groups and stores the matching text for those groups alongside the supplied names. To access those groups, use the Groups collection, referencing group elements by name:

PhoneParts foundPhone = new PhoneParts()
{
  AreaCode = results.Groups["AreaCode"].Value,
  Exchange = results.Groups["Exchange"].Value,
  Number = results.Groups["Number"].Value
};

The updated code reduces dependencies on supporting logic and hardcoded numeric values. Perhaps regular expression parsing is slower than string manipulation; I didn't run any performance analysis on the code samples presented here. But even if the regex-based code is a tad slower, the ability to switch out a resource string in a globalized manner is much better than developing fresh logic for every international phone number platform. And that type of forward thinking is nothing to be afraid of.

About the Author

Tim Patrick has spent more than thirty years as a software architect and developer. His two most recent books on .NET development -- Start-to-Finish Visual C# 2015, and Start-to-Finish Visual Basic 2015 -- are available from http://owanipress.com. He blogs regularly at http://wellreadman.com.

comments powered by Disqus

Featured

Subscribe on YouTube