Code Focused

Regular Expressions: Not as Tricky as They May Seem

Regular expressions are like power tools: They may look scary, but are easy to use once you understand their basic building blocks.

Regular expressions -- those scary strings that might as well be written in Klingon to the average person -- can be a vast time-saver. They help in one of the most common tasks of programming: string manipulation. The .NET Framework has an excellent, built-in regular expressions engine that's relatively straightforward to use.

The use and understanding of regular expressions (hereafter referred to as RegEx) traditionally comes in three parts: testing a string to see if a pattern exists within it (pattern matching); reading strings and extracting useful information; manipulating and making changes to those strings.

Pattern Matching
The first thing you'll likely want to use a RegEx for is to do pattern matching, the simplest example of which is determining whether a particular set of characters exists within a string. In fact, if you've ever used the "Find" feature in any text editor, you've effectively used a regular expression composed only of the most basic pieces: literals. In a RegEx pattern, a literal is a character that must be matched exactly. Searching for the pattern "BCD" within the string "ABCDE" would result in a match, for example. In this case, the search pattern would read as, "The literal B, immediately followed by the literal C, immediately followed by the literal D."

As beneficial as that is, there are certainly better and simpler ways of accomplishing a simple search without the use of regular expressions. The power of RegEx doesn't become apparent until you start using elements like character classes, or groups of characters that can be accepted as a match. Character classes are denoted by the brackets surrounding them. Take the character class [0123456789] for example, which instructs the RegEx engine to match any character that is a number between 0 and 9. Alternatively, you could use [0-9], which has the same meaning to the engine, but is more readable to a human. RegEx additionally has some helpful shortcuts for commonly used classes like this one. [0-9] can be further shortened by using \d, the shortcut for "Match any numeric character."

The Regex Object
Using RegEx in your code requires the RegEx object, located within the System.Text.RegularExpressions namespace. For a usage example, see Listing 1 below. The code in this example reads a string as input from the console, then shows a response depending on whether the string contains a numeric character anywhere within it (using \d as the pattern). You can experiment with different things in the pattern, such as \d\d, which will match only if two numbers are next to each other somewhere in the string (e.g., "12ab" will match, but "1a2" will not).

Listing 1: Test To See If the String Entered Contains a Number
Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim input As String = ""
    Dim pattern As String = "\d"
    While Not input.ToUpper() = "Q"
      Console.WriteLine("Enter a string, or Q to exit.")
      input = Console.ReadLine()
      If Regex.IsMatch(input, pattern) Then
        Console.WriteLine("String contains a number")
      Else
        Console.WriteLine("String does NOT contain a number")
      End If
    End While
  End Sub
End Module

Wildcards
In RegEx, wildcards are really just shortcuts to character classes. Just like \d is a shortcut for [0-9], there are other shortcuts, too. See Table 1 below for a list of the most commonly used character classes. Note that the period is special in that it matches any character except the line break by default.

Table 1: A List of Character Class Shortcuts
Shortcut Class Description
\d [0-9] numeric character
\D [^0-9] NOT a numeric character
\w [a-zA-Z0-9_] "word" character
\W [^a-zA-Z0-9_] NOT a "word" character
\s [ \f\n\r\t\v] whitespace character
\S [^ \f\n\r\t\v] NOT a whitespace character
. [^\n] Any character except line break

Validation
One common use of the Regex.IsMatch() method is in validation. For example, if you wanted to validate that a phone number was entered, a simple approach to that would be a pattern of "\d\d\d-\d\d\d\d" -- in this case, the hyphen (-) acts as a literal hyphen and nothing else, so the pattern reads as, "Three digits, followed directly by a hyphen, followed directly by four more digits." Of course, this requires the hyphen to exist; if it's omitted from the input, there won't be a match. To solve this, you can add a question mark after the hyphen, as "\d\d\d-?\d\d\d\d" shows. The question mark acts as a quantifier in this case, meaning "Match zero or one" of the preceding character, the hyphen.

Quantifiers
There are a few other quantifiers aside from the question mark. The most commonly used is the asterisk, which means, "Match as many as possible." In the preceding example, an asterisk could have been used in place of the question mark like this: "\d\d\d-*\d\d\d\d"; this would result in a match regardless of how many hyphens were entered. For example, "123----5555" would result in a match.

Another quantifier almost the same as the asterisk is the plus sign, which means, "Match as many as possible, with a minimum of one." If you were to use the plus instead of the asterisk, like this:

"\d\d\d-+\d\d\d\d"

a string of "123-5555" would result in a match, but "1235555" would not.

There's also a quantifier that allows you to be more precise in how many times something must be repeated for it to trigger a match. You can set these by using curly brackets, like this: "\d{3}-\d{4}". This reads as, "Match exactly three digits, followed by one hyphen, followed by exactly four digits." Another alternative usage allows a range to be specified instead. An example would be "\d{3,5}", a pattern matching anywhere from three to five digits. Note that this pattern will also match if the input is "123456", because the RegEx engine successfully finds the required number of digits.

Grouping
As mentioned above, quantifiers always act on the preceding character or group. To allow for repetitions of more than one character, you can create something called a grouping. Back to the phone number example, try using the pattern "(\d\d\d-){1,2}\d\d\d\d" to allow for area codes to be entered. With this pattern, both "123-5555" and "444-123-5555" would match.

In Listing 2 below, I bring it all together and create a somewhat more robust phone validation pattern.

Listing 2: A More Robust Phone Validation Pattern
Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim input As String = ""
    Dim pattern As String = "(\d\d\d-){1,2}\d\d\d\d"
    While Not input.ToUpper() = "Q"
      Console.WriteLine("Enter a string, or Q to exit.")
      input = Console.ReadLine()
      If Regex.IsMatch(input, pattern) Then
        Console.WriteLine("String contains a valid phone number")
      Else
        Console.WriteLine("String does NOT contain a valid phone number")
      End If
    End While
  End Sub
End Module

Getting Started
Regular expressions are a deep topic, and this article only scratches the surface. While I've only showed their use in validation thus far, they're also quite handy in many scenarios involving parsing or string manipulation. With a solid understanding, regular expressions will become an invaluable tool to have in your tool belt. Check back next month for the next article in this series, in which I go deeper into the internals of regular expressions and how you can get even more out of them.

About the Author

Ondrej Balas owns UseTech Design, a Michigan development company focused on .NET and Microsoft technologies. Ondrej is a Microsoft MVP in Visual Studio and Development Technologies and an active contributor to the Michigan software development community. He works across many industries -- finance, healthcare, manufacturing, and logistics -- and has expertise with large data sets, algorithm design, distributed architecture, and software development practices.

comments powered by Disqus

Featured

  • AI for GitHub Collaboration? Maybe Not So Much

    No doubt GitHub Copilot has been a boon for developers, but AI might not be the best tool for collaboration, according to developers weighing in on a recent social media post from the GitHub team.

  • Visual Studio 2022 Getting VS Code 'Command Palette' Equivalent

    As any Visual Studio Code user knows, the editor's command palette is a powerful tool for getting things done quickly, without having to navigate through menus and dialogs. Now, we learn how an equivalent is coming for Microsoft's flagship Visual Studio IDE, invoked by the same familiar Ctrl+Shift+P keyboard shortcut.

  • .NET 9 Preview 3: 'I've Been Waiting 9 Years for This API!'

    Microsoft's third preview of .NET 9 sees a lot of minor tweaks and fixes with no earth-shaking new functionality, but little things can be important to individual developers.

  • Data Anomaly Detection Using a Neural Autoencoder with C#

    Dr. James McCaffrey of Microsoft Research tackles the process of examining a set of source data to find data items that are different in some way from the majority of the source items.

  • What's New for Python, Java in Visual Studio Code

    Microsoft announced March 2024 updates to its Python and Java extensions for Visual Studio Code, the open source-based, cross-platform code editor that has repeatedly been named the No. 1 tool in major development surveys.

Subscribe on YouTube