Code Focused

Extracting Data with Regular Expressions

Regular expressions give Tim Patrick the creeps, but he overcame his fears by discovering specially crafted regex patterns can access data in a way that’s actually kind of cool.

I've always been secretly afraid of regular expressions. I spent many of my earliest programming years working with tenacious development tools in a mainframe-hosted Unix environment. I even used to write AWK scripts. (Pause for gasps of horror.) So, regular expressions should be more of the same. And yet they give me the creeps. That level of fear kept me from learning, until recently, that you can simultaneously validate and extract named data elements from text strings using specially crafted regex patterns. And it turns out that my fears were unfounded, because accessing data in this way is actually kind of cool.

Consider the mundane example of obtaining a 10-digit North American phone number from a user, and parsing out the three-digit area code, the three-digit exchange office and the four-digit subscriber number components. The components can be stored in a specially crafted class:

private class PhoneParts
{
  public string AreaCode; // First 3 digits
  public string Exchange; // Next 3 digits
  public string Number;   // Last 4 digits
}

Once you've obtained some text that the user claims to be a phone number, you can throw away anything that isn’t a digit, confirm that there are 10 digits in all, and locate the components by grabbing substrings of the right lengths from fixed positions:

// ----- This only works on 10-digit numbers.
string cleanText = DigitsOnly(originalText);
if (cleanText.Length == 10)
{
  // ----- Extract out the parts by physical position.
  PhoneParts foundPhone = new PhoneParts()
  {
    AreaCode = cleanText.Substring(0, 3),
    Exchange = cleanText.Substring(3, 3),
    Number = cleanText.Substring(6, 4)
  };
}

The DigitsOnly helper method returns just the digits found in an original text string:

string DigitsOnly(string originalText)
{
  // ----- Return just the digits from a source string.
  string result = "";
  if (originalText == null)
    return result;
  for (int position = 0; position < originalText.Length; position++)
    if (char.IsDigit(originalText[position]))
      result += originalText[position];
  return result;
}

That all works fine. But I had to add a helper method (DigitsOnly) and perform validation using magic numbers (the positions and lengths) to get the code working. It turns out that you can delegate a lot of that work to a well-crafted regular expression:

const string PhonePattern =
  @"^\(?(?!0|1)(?<AreaCode>\d{3})\)?\s?-?(?!0|1)(?<Exchange>\d{3})\s?-?(?<Number>\d{4})$";

The PhonePattern expression looks for the 10 digits of the phone number, granting forgiveness to the user for any hyphens or spaces between the numeric components and allowing a set of parentheses to surround the area code. The pattern also rejects content that includes a zero or one digit at the start of an area code or exchange, because those are forbidden in North American phone numbers.

The magical parts of the expression are the parenthesized capturing groups that include group names in angle brackets:

// ----- A capturing group named "AreaCode," matching three digits.
(?<AreaCode>\d{3})

If the incoming text matches the pattern, each capturing group is saved within the results, tied to the indicated group name. Those matching portions can be accessed by name from the results set:

// ----- Assumes: using System.Text.RegularExpressions;
Regex processor = new Regex(PhonePattern);
Match results = processor.Match(originalText);
if (results.Success)
{
  // ----- Named groups are ready to use.
}

The Regex.Match method analyzes and validates the incoming text, isolates the capturing groups and stores the matching text for those groups alongside the supplied names. To access those groups, use the Groups collection, referencing group elements by name:

PhoneParts foundPhone = new PhoneParts()
{
  AreaCode = results.Groups["AreaCode"].Value,
  Exchange = results.Groups["Exchange"].Value,
  Number = results.Groups["Number"].Value
};

The updated code reduces dependencies on supporting logic and hardcoded numeric values. Perhaps regular expression parsing is slower than string manipulation; I didn't run any performance analysis on the code samples presented here. But even if the regex-based code is a tad slower, the ability to switch out a resource string in a globalized manner is much better than developing fresh logic for every international phone number platform. And that type of forward thinking is nothing to be afraid of.

About the Author

Tim Patrick has spent more than thirty years as a software architect and developer. His two most recent books on .NET development -- Start-to-Finish Visual C# 2015, and Start-to-Finish Visual Basic 2015 -- are available from http://owanipress.com. He blogs regularly at http://wellreadman.com.

comments powered by Disqus

Featured

  • Hands On: New VS Code Insiders Build Creates Web Page from Image in Seconds

    New Vision support with GitHub Copilot in the latest Visual Studio Code Insiders build takes a user-supplied mockup image and creates a web page from it in seconds, handling all the HTML and CSS.

  • Naive Bayes Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the naive Bayes regression technique, where the goal is to predict a single numeric value. Compared to other machine learning regression techniques, naive Bayes regression is usually less accurate, but is simple, easy to implement and customize, works on both large and small datasets, is highly interpretable, and doesn't require tuning any hyperparameters.

  • VS Code Copilot Previews New GPT-4o AI Code Completion Model

    The 4o upgrade includes additional training on more than 275,000 high-quality public repositories in over 30 popular programming languages, said Microsoft-owned GitHub, which created the original "AI pair programmer" years ago.

  • Microsoft's Rust Embrace Continues with Azure SDK Beta

    "Rust's strong type system and ownership model help prevent common programming errors such as null pointer dereferencing and buffer overflows, leading to more secure and stable code."

  • Xcode IDE from Microsoft Archrival Apple Gets Copilot AI

    Just after expanding the reach of its Copilot AI coding assistant to the open-source Eclipse IDE, Microsoft showcased how it's going even further, providing details about a preview version for the Xcode IDE from archrival Apple.

Subscribe on YouTube

Upcoming Training Events