Code Focused

Extracting Data with Regular Expressions

Regular expressions give Tim Patrick the creeps, but he overcame his fears by discovering specially crafted regex patterns can access data in a way that’s actually kind of cool.

I've always been secretly afraid of regular expressions. I spent many of my earliest programming years working with tenacious development tools in a mainframe-hosted Unix environment. I even used to write AWK scripts. (Pause for gasps of horror.) So, regular expressions should be more of the same. And yet they give me the creeps. That level of fear kept me from learning, until recently, that you can simultaneously validate and extract named data elements from text strings using specially crafted regex patterns. And it turns out that my fears were unfounded, because accessing data in this way is actually kind of cool.

Consider the mundane example of obtaining a 10-digit North American phone number from a user, and parsing out the three-digit area code, the three-digit exchange office and the four-digit subscriber number components. The components can be stored in a specially crafted class:

private class PhoneParts
{
  public string AreaCode; // First 3 digits
  public string Exchange; // Next 3 digits
  public string Number;   // Last 4 digits
}

Once you've obtained some text that the user claims to be a phone number, you can throw away anything that isn’t a digit, confirm that there are 10 digits in all, and locate the components by grabbing substrings of the right lengths from fixed positions:

// ----- This only works on 10-digit numbers.
string cleanText = DigitsOnly(originalText);
if (cleanText.Length == 10)
{
  // ----- Extract out the parts by physical position.
  PhoneParts foundPhone = new PhoneParts()
  {
    AreaCode = cleanText.Substring(0, 3),
    Exchange = cleanText.Substring(3, 3),
    Number = cleanText.Substring(6, 4)
  };
}

The DigitsOnly helper method returns just the digits found in an original text string:

string DigitsOnly(string originalText)
{
  // ----- Return just the digits from a source string.
  string result = "";
  if (originalText == null)
    return result;
  for (int position = 0; position < originalText.Length; position++)
    if (char.IsDigit(originalText[position]))
      result += originalText[position];
  return result;
}

That all works fine. But I had to add a helper method (DigitsOnly) and perform validation using magic numbers (the positions and lengths) to get the code working. It turns out that you can delegate a lot of that work to a well-crafted regular expression:

const string PhonePattern =
  @"^\(?(?!0|1)(?<AreaCode>\d{3})\)?\s?-?(?!0|1)(?<Exchange>\d{3})\s?-?(?<Number>\d{4})$";

The PhonePattern expression looks for the 10 digits of the phone number, granting forgiveness to the user for any hyphens or spaces between the numeric components and allowing a set of parentheses to surround the area code. The pattern also rejects content that includes a zero or one digit at the start of an area code or exchange, because those are forbidden in North American phone numbers.

The magical parts of the expression are the parenthesized capturing groups that include group names in angle brackets:

// ----- A capturing group named "AreaCode," matching three digits.
(?<AreaCode>\d{3})

If the incoming text matches the pattern, each capturing group is saved within the results, tied to the indicated group name. Those matching portions can be accessed by name from the results set:

// ----- Assumes: using System.Text.RegularExpressions;
Regex processor = new Regex(PhonePattern);
Match results = processor.Match(originalText);
if (results.Success)
{
  // ----- Named groups are ready to use.
}

The Regex.Match method analyzes and validates the incoming text, isolates the capturing groups and stores the matching text for those groups alongside the supplied names. To access those groups, use the Groups collection, referencing group elements by name:

PhoneParts foundPhone = new PhoneParts()
{
  AreaCode = results.Groups["AreaCode"].Value,
  Exchange = results.Groups["Exchange"].Value,
  Number = results.Groups["Number"].Value
};

The updated code reduces dependencies on supporting logic and hardcoded numeric values. Perhaps regular expression parsing is slower than string manipulation; I didn't run any performance analysis on the code samples presented here. But even if the regex-based code is a tad slower, the ability to switch out a resource string in a globalized manner is much better than developing fresh logic for every international phone number platform. And that type of forward thinking is nothing to be afraid of.

About the Author

Tim Patrick has spent more than thirty years as a software architect and developer. His two most recent books on .NET development -- Start-to-Finish Visual C# 2015, and Start-to-Finish Visual Basic 2015 -- are available from http://owanipress.com. He blogs regularly at http://wellreadman.com.

comments powered by Disqus

Featured

  • AI for GitHub Collaboration? Maybe Not So Much

    No doubt GitHub Copilot has been a boon for developers, but AI might not be the best tool for collaboration, according to developers weighing in on a recent social media post from the GitHub team.

  • Visual Studio 2022 Getting VS Code 'Command Palette' Equivalent

    As any Visual Studio Code user knows, the editor's command palette is a powerful tool for getting things done quickly, without having to navigate through menus and dialogs. Now, we learn how an equivalent is coming for Microsoft's flagship Visual Studio IDE, invoked by the same familiar Ctrl+Shift+P keyboard shortcut.

  • .NET 9 Preview 3: 'I've Been Waiting 9 Years for This API!'

    Microsoft's third preview of .NET 9 sees a lot of minor tweaks and fixes with no earth-shaking new functionality, but little things can be important to individual developers.

  • Data Anomaly Detection Using a Neural Autoencoder with C#

    Dr. James McCaffrey of Microsoft Research tackles the process of examining a set of source data to find data items that are different in some way from the majority of the source items.

  • What's New for Python, Java in Visual Studio Code

    Microsoft announced March 2024 updates to its Python and Java extensions for Visual Studio Code, the open source-based, cross-platform code editor that has repeatedly been named the No. 1 tool in major development surveys.

Subscribe on YouTube