In-Depth

Working with Parser Combinators

Parser combinators are put to work in a real-life scenario as custom configurations are designed for neuro-optical scientific experiments in which optical tissue is stimulated and the results are recorded.

I have a client that's neck-deep in the neuro-optical scientific world and asked me to work with him on a new project designed to make conducting experiments on optical tissue easier. Specifically, I'm working on a software system to control a microscope rig that will drive various stimulus devices (LEDs, lights and so on) that will trigger responses from the optical tissue, and then capture the results measured by hardware watching the optical tissue.

If it all sounds vaguely Matrix-y to you, you're not entirely alone. When I first heard about this project, my reaction was simultaneously, "Oh, wow, that's cool!" and, "Oh, wait, I just threw up in my mouth a little."

At any rate, one of the key things about the rig is that it will have a fairly complex configuration associated with each experiment run, and that led us to contemplate how to specify that configuration. On the one hand, it seemed an obvious problem for an XML file. However, the people running the rig aren't going to be computer programmers, but rather scientists and lab assistants, so it seemed a little heavy-handed to expect them to write well-formed XML files (and get it right at every turn). The thought of producing some kind of GUI-based configuration system struck us as highly over-engineered, particularly as it would quickly turn into discussions of how best to capture open-ended kinds of data.

In the end, it seemed more appropriate to give them a custom configuration format, which meant tons of parsing text on my part. (To some, this would imply that I'm building a DSL; this is a debate best left to philosophers and others involved in the serious task of alcohol consumption.) Fortunately, solutions abound in this space.

Thoughts
A parser serves two interesting and useful purposes: converting text into some other, more meaningful form, and verifying/validating the text follows a certain structure (which typically is part of helping to convert it into a more meaningful form). So, for example, a phone number, which is at its heart just a sequence of numbers, still has a structure to it that requires verification. That format varies from continent to continent, but the numbers are still numbers. In fact, a phone number is a great example of a case where the "more meaningful form" isn't an integer value -- the digits aren't an integer value, they're a symbolic representation that's usually better represented as a domain type. (Treating them as "just" a number makes it difficult to extract the country code or area code, for example.)

If a phone number is made up of digits, and so are numbers (salaries, employee IDs and so on), then there's going to be some duplication in code where we parse and verify digits, unless we somehow extend a parser. This implies, then, that we'd like whatever parser we build to be open-ended, allowing somebody using the parser/library to extend it in different ways (Canadian postal codes, for example) without having to modify the source itself. This is known as the "open-closed principle": Software entities should be open for extension, but closed for modification.

Solution: Generative Metaprogramming
One solution is the traditional "lex/yacc" approach, known more formally as a "parser generator." This entails specifying the syntax for the configuration file in an abstract format -- usually some variation on the Backus-Naur form (BNF) syntax/grammar used to describe formal grammar, such as what most programming languages use -- then running a tool to generate code to pick the string input apart and yield some kind of tree of structures or objects as a result. Generally, this involved process is split into two steps, "lexing" and "parsing," in which the lexer first transforms the string input into tokens, validating that the characters do in fact form legitimate tokens along the way. Then the parser takes the tokens and validates that the tokens are appearing in the appropriate order and contain the appropriate values, and so on, usually transforming the tokens into some kind of abstract tree structure for further analysis.

The problems with parser generators are the same for any generative metaprogramming approach: The code generated will need to be regenerated in the event that the syntax changes. But more importantly for this kind of scenario, the code generated will be computer-generated, with all the wonderful variable naming that comes with computer-generated code (anybody ready to stand up for variables such as "integer431" and "string$$x$y$z"?), thus difficult to debug.

Solution: Functional
In a certain kind of light, parsing is fundamentally functional: It takes input, performs some kind of operation on it and produces output as a result. The critical insight, it turns out, is that a parser can be created out of lots of little parsers, each of which parses a tiny bit of the string input, then returns a token and another function to parse the next little bit of string input. These techniques, which I believe were introduced in Haskell, are formally known as parser combinators, and they turn out to be an elegant solution to "midsize" parsing problems -- parsers that aren't necessarily as complex as a programming language would require, but something beyond what String.Split (or a hacked-up series of regex scans) can do.

In the case of parser combinators, the open-for-extension requirement is achieved by creating small functions, then using functional techniques to "combine" them into larger functions (which is where we get the name "combinators"). Larger parsers can be composed by anyone with sufficient skill to understand function composition. This technique is a general one that bears exploration, but I'll save that for a future column.

As it turns out, there are several parser combinator libraries available for the Microsoft .NET Framework, many of them based on the Parsec module written in Haskell that sort of set the standard for parser combinatory libraries. Two such libraries are FParsec, written for F#, and Sprache, written for C#. Each is open source and relatively well-documented, such that they serve the dual purpose of being both useful out of the box and as a model from which to study design ideas. I'll also leave FParsec for a future column.

"Sprache Sie Parsing?"
Sprache, available at code.google.com/p/sprache, describes itself as a "simple, lightweight library for constructing parsers directly in C# code," which "doesn't compete with 'industrial strength' language workbenches. It fits somewhere in between regular expressions and a full-featured toolset such as ANTLR." (ANTLR is a parser generator, fitting into the Generative Metaprogramming category, like lex/yacc.)

Getting started with Sprache is straightforward: Download the code, build the project, then copy the Sprache.dll assembly into your project's dependency directory and add the reference to the project. From here, all the parser definition work is done by declaring Sprache.Parser instances and combining them in particular ways to create Sprache.Parser instances, which in turn may, if desired (and it usually is), return domain objects containing some or all of the parsed values.

Sprache Simple
To begin, let's start with a parser that knows how to parse user-entered phone numbers into a PhoneNumber domain type. For simplicity, I'll stick with the U.S.-style format -- (nnn) nnn-nnnn -- but we want to specifically recognize the breakdown in area codes, prefix and line, and allow for letters in the place of digits (so somebody can enter their phone number as "(800) EAT-NUTS" if they desire). Ideally, the PhoneNumber domain type will convert between alpha and all-numeric forms on demand, but that functionality will be left as an exercise to the reader (meaning, essentially, that I don't want to bother with it).

(The pedant in me demands to point out that simply converting all the alphas to numbers isn't a fully compliant solution, by the way. In college, it was common in my circle of friends to try to come up with phone numbers that spelled out "cool" things -- one ex-roommate is still waiting for 1-800-CTHULHU to become free, in fact, so he can win the game for all eternity.)

The easiest place to start is with the PhoneNumber domain type:

class PhoneNumber
{
  public string AreaCode { get; set; }
  public string Prefix { get; set; }
  public string Line { get; set; }
}

Were this a "real" domain type, AreaCode, Prefix and Line would have validation code in their property-set methods, but that would lead to a repetition of code between the parser and the domain class (which, by the way, we'll fix before this is all done).

Next, we need to know how to create a simple parser that knows how to parse n number of digits:

public static Parser<string> numberParser =
  Parse.Digit.AtLeastOnce().Text();

Defining the numberParser is straightforward. Begin with the primitive parser Digit (an instance of a Parser<T> defined on the Sprache.Parse class), and describe that we want at least one digit in the input stream, implicitly consuming all digits until the input stream either runs dry or the parser encounters a non-digit character. The Text method converts the stream of parsed results into a single string for our consumption.

Testing this is pretty easy -- feed it a string and let 'er rip:

[TestMethod]
public void ParseANumber()
{
  string result = numberParser.Parse("101");
  Assert.AreEqual("101", result);
}
[TestMethod]
public void FailToParseANumberBecauseItHasTextInIt()
{
  string result = numberParser.TryParse("abc").ToString();
  Assert.IsTrue(result.StartsWith("Parsing failure"));
}

When run, this stores "101" into result. If the Parse method is fed an input string of "abc," it will yield an exception. (If nonthrowing behavior is preferred, Sprache also has a TryParse method that returns a Result object that can be interrogated regarding success or failure.)

The phone-number parsing situation is a little bit more complicated, though; it needs to parse just three or four digits -- no more, no less. Defining one such parser (the three-digit parser) is a bit trickier, but still doable:

public static Parser<string> threeNumberParser =
  Parse.Numeric.Then(first =>
    Parse.Numeric.Then(second =>
      Parse.Numeric.Then(third =>
        Parse.Return(first.ToString() +
          second.ToString() + third.ToString()))));

The Numeric parser takes a character and, if it's a digit, advances to the next character. The Then method takes a function (in the form of a lambda expression) to execute. The Return method collects each of these into a single string and, as its name implies, uses that as the return value (see Listing 1).

Success. So far. (Yes, the definition of threeNumberParser is awkward -- surely there has to be a better way to define this! Fear not: there is, but to understand how to extend the parser, we have to dive deeper into how Sprache is constructed, and that's the subject of the next part in this series.)

Now, however, we need to handle the left-parens, the right-parens and the dash, and convert everything into a PhoneNumber object. It might seem a bit awkward with what we see so far, but watch what happens next, as shown in Listing 2.

Using the parser becomes pretty straightforward at this point:

[TestMethod]
public void ParseAFullPhoneNumberWithSomeWhitespace()
{
  var result = phoneParser.Parse("(425) 647-4526");
  Assert.AreEqual("425", result.AreaCode);
  Assert.AreEqual("647", result.Prefix);
  Assert.AreEqual("4526", result.Line);
}

Best of all, the parser is entirely extensible, because it, too, can be composed into a larger parser that transforms text input into an Address object or ContactInfo object or anything else imaginable.

The Combinatorics Concept
Historically, parsing text has been the province of "language researchers" and academia, largely due to the complicated and difficult edit-generate-compile-test-debug cycle inherent with generative metaprogramming solutions. Trying to walk through computer-generated code -- particularly the finite-state-machine-based versions that many parser generators churn out -- in a debugger is a challenge to even the most hard-bitten developer. For that reason, most developers don't think about solutions along parsing lines when presented with a text-based problem. And, in truth, most of the time, a parser- generator-based solution would be drastic overkill.

Parser combinators serve as a nice in-between solution: flexible enough and powerful enough to handle some nontrivial parsing, without requiring a Ph.D. in computer science to understand how to use them.

About the Author

Ted Neward is a programming language, virtual machine, and enterprise-scale architect. He has written a dozen books and hundreds of articles on .NET, Java, enterprise systems, mobile development, and programming languages. He resides in the Pacific Northwest, and can be found on the Internet at www.tedneward.com, www.itrellis.com, @tedneward on Twitter, and blogs at blogs.tedneward.com.

comments powered by Disqus

Featured

Subscribe on YouTube