Modern C++

Compiler Basics How-To, Part 4: Improving the Parser

Now you have compiler fundamentals down pat, let's move on to some of the techniques you can use to improve the parser's design and performance.

So far in this series on compiler basics, I've shown:

  1. A simple yet efficient way of reading the original source code into memory using a memory-mapped file.
  2. Lexical analysis, which is the process of turning some source code into a stream of tokens.
  3. Syntax analysis, which is part of compilation that's often called parsing and is responsible for enforcing a given syntax or grammar.

Now, it's time to discuss a few techniques to improve the overall design of the parser. I'll cover performance, general structure and what can be done to considerably improve error handling.

Before I can do any of that I need to revisit the scanner once again. I discussed the scanner in part 2. I then added the Backtrack method in part 3 to solve the problem of the parser consuming a token that it can't deal with. Although it might seem convenient to have the ability to backtrack the scanner, it presents a number of challenges that aren't very easy to solve. 

The first and most obvious problem is that the last token is analyzed repeatedly. Modern processors are great at optimizing a scan of large chunks of memory, but backtracking represents a nontrivial speed bump to that process. Practically, it also poses a design challenge. I want to add the ability to pinpoint the line number for any error that may be detected. This would be simple enough, but for backtracking, the problem is somewhat alleviated by the Scanner Read method that first skips any whitespace before marking the previous scan position. Still, it makes me a little uncomfortable to think through all the possible side effects of a semi-bidirectional scanner.

As I continue to develop the compiler, I need to consider where to invest in some refactoring and this seems like a good bet. I'll improve the Scanner class to keep track of source line numbers and also change its public interface to avoid backtracking. First, I'll deal with line numbers by adding a member variable to keep track of this position:

unsigned m_line = 1;

Because line numbers are normally counted by starting with one rather than zero, I've initialized the variable to reflect this convention. Updating this line number is surprisingly simple, thanks to the Scanner SkipWhitespace method. As it stands, it simply skips over whitespace characters:

while (m_next != m_end)
{
  if (isspace(*m_next))
  {
    ++m_next;
  }
  else
  {
    break;
  }
}

I'm going to match `\n' -- line-feed -- to indicate a new line in the source file. This will accommodate both Windows- and Unix-style line endings. Some Windows programs use "\r\n" -- carriage-return, line-feed -- to denote new lines, but the line-feed is always at the end. If you'd like to handle a carriage-return on its own, then it's only marginally more complicated. Of course, a line-feed character is considered whitespace so I need to check for this before calling the more generic isspace function, as shown in Listing 1.

Listing 1: Check for Whitespace Before Calling Isspace Function
while (m_next != m_end)
{
  if ('\n' == *m_next)
  {
    ++m_next;
    ++m_line;
  }
  else if (isspace(*m_next))
  {
    ++m_next;
  }
  else
  {
    break;
  }
}

I can then simply add a public GetLine method to the Scanner class that the caller might use to report the line number in the event that a parsing error occurs:

unsigned GetLine() const noexcept
{
  return m_line;
}

Now I'll get rid of backtracking. As it stands, the Scanner class provides the Read method that consumes the next token in the stream and returns a constant indicating what type of token was read. To avoid backtracking, I need to give callers the ability to consider the next token without affecting the position of the scanner, so I'll start by adding a member variable to hold the last token:

Token m_token = Token::Invalid;

And I'll add a corresponding public method that callers may use to retrieve this value:

Token GetToken() const noexcept
{
  return m_token;
}

Now, instead of the Read method, which previously consumed one token and returned the corresponding enumerator, I'll add a Next method that returns nothing. The name change isn't strictly necessary, but it will highlight the change in semantics for existing callers. Callers can then assume a token is ready for consumption and only call the Next method after they've actually consumed the token. In this way, backtracking is avoided because one part of the parser can delegate the consumption of the token, should it decide not to handle it. The Next method isn't much different to the original Read method:

void Next() noexcept
{
  SkipWhitespace();

  if (m_next == m_end)
  {
    m_token = Token::EndOfFile;
    return;
  }

  char const c = *m_next;

The m_prev member variable is gone and this method is responsible for updating the m_token member variable instead. The sequence of if statements must, however, be changed because the token enumerator is not returned directly, as shown in Listing 2.

Listing 2: Change Sequence of If Statements
if ('+' == c)
{
  ++m_next;
  m_token = Token::Add;
}
else if ('-' == c)
{
  ++m_next;
  m_token = Token::Subtract;
}
// And so on...
else
{
  m_token = Token::Invalid;
}

So that takes care of the Scanner. Rather than cram this into the existing parser, I'm simply going to rewrite the three parsing functions and I'll take this opportunity to improve their error handling, as well. The existing parser returned a bool value to indicate success or failure, because there wasn't much more to report. But now that I have line number information, I'd like to improve the quality of the parsing experience. I'll begin by defining some typical errors that syntax analysis is likely to encounter:

enum class SyntaxError
{
  InvalidToken,
  PrimaryExpected,
  DivideByZero,
  RightParenthesisExpected,
  UnexpectedToken,
};

An invalid token is really just an acknowledgement that the underlying Scanner returned the Token::Invalid enumerator, but this is certainly an error. An unexpected token indicates that a valid token arrived when no token was expected, or at least that particular token wasn't expected. I'll define a simple exception class to carry this information, along with a line number, to the catch site:

struct SyntaxException
{
  SyntaxError Error;
  unsigned Line;

  SyntaxException(SyntaxError const error,
                  unsigned const line) :
    Error(error),
    Line(line)
  {}
};

As before, I'll start with some idealistic code, as shown in Listing 3.

Listing 3: Idealistic Code
try
{
  Scanner scanner(file.Begin(), file.End());
  scanner.Next();
  float const value = ScanExpression(scanner);

  if (Token::EndOfFile != scanner.GetToken())
  {
    throw SyntaxException(SyntaxError::UnexpectedToken,
                          scanner.GetLine());
  }

  printf("Result: %.2f\n", value);
}
catch (SyntaxException const & e)
{
  // Report error to user
}

Consider the code in Listing 3 for a moment. I start by priming the scanner with the first token in the stream. I do so before calling the ScanExpression function so that a token is ready for evaluation. The parsing functions can then simply call the Next method if and when they actually consume a token. Once the recursive parser returns I also need to ensure the end of the file has been reached. Because I'm using exceptions, I can take advantage of the parsing function's return value to return the result of the expression, further simplifying the overall design:

float ScanExpression(Scanner & scanner);
float ScanTerm(Scanner & scanner);
float ScanPrimary(Scanner & scanner);

I'll begin at the bottom, logically speaking, with the ScanPrimary function. It's responsible for parsing primary expressions. This is either a number or some expression within parentheses. Because the functions can assume the Scanner is primed with the next token, it can simply start by getting the current token:

float ScanPrimary(Scanner & scanner)
{
  Token const token = scanner.GetToken();

Assuming the token refers to a number, it can simply return the number itself:

if (Token::Number == token)
{
  float const value = scanner.GetNumber();

  scanner.Next();
  return value;
}

Notice that I'm careful to move the scanner forward because I've decided to consume this token. I'm also careful to get the number from the Scanner before calling Next. This is just a bit of defensive programming to ensure the number returned is valid. The same applies when processing the alternative:

if (Token::LeftParenthesis == token)
{
  scanner.Next();

Here, I'm consuming the left parentheses so I'm careful to move the Scanner forward again. I can then scan the expression itself:

float const value = ScanExpression(scanner);

I must then confirm the expression is followed by a right parenthesis. Because the ScanExpression function will also move the Scanner forward as needed, I can simply check whether the current token is as expected. If not, I can use the handy exception class to report the error up the stack to the caller:

if (Token::RightParenthesis != scanner.GetToken())
{
  throw SyntaxException(SyntaxError::RightParenthesisExpected,
                        scanner.GetLine());
}

Having consumed the right parenthesis, I can return the expression's value while remembering to move the Scanner forward:

scanner.Next();
return value;

The ScanPrimary function concludes by throwing an exception if neither of the alternatives are found:

throw SyntaxException(SyntaxError::PrimaryExpected,
                      scanner.GetLine());

Notice how each time an error occurs, I can safely capture the line number before unwinding the stack. This ensures the syntax error code and line number can outlive the Scanner object itself.

The ScanTerm and ScanExpression functions follow a similar pattern, except they continue to scan indefinitely until a token with which they cannot deal is encountered. Fortunately, backtracking is no longer required to break out of these parsing loops. ScanTerm expects a primary expression so it can simply call ScanPrimary before jumping in the loop:

float ScanTerm(Scanner & scanner)
{
  float value = ScanPrimary(scanner);

  for (;;)
  {
  }

  return value;
}

Ironically, the act of adding error reporting greatly simplifies the implementation of these parsing functions. Within the loop, ScanTerm can simply check for its grammar alternatives. First, multiplication:

Token const token = scanner.GetToken();

if (Token::Multiply == token)
{
  scanner.Next();
  value *= ScanPrimary(scanner);
}

And then division:

else if (Token::Divide == token)
{
  scanner.Next();
  float const other = ScanPrimary(scanner);

  if (0 == other)
  {
    throw SyntaxException(SyntaxError::DivideByZero,
                          scanner.GetLine());
  }

  value /= other;
}

Again, each time the parser consumes a token it must move the Scanner forward. If a token that doesn't represent a term is encountered, I simply punt and break out of the loop to allow the ScanExpression function to deal with it:

else
{
  break;
}

Notice again that I've removed the backtracking that would've been necessary at this point. The ScanExpression is almost identical, bar the obvious -- add and subtract rather than multiply and divide -- and the reporting of invalid tokens. I've included the updated code with this article for your convenience.

The compiler is maturing! Performance is improved with large source files. Non-trivial expressions demand better error handling and the parser now delivers. The refactoring has also led to an improved implementation that's easier to extend and simpler to reason about. I meant to touch on code comments and more, but that will have to wait until next time.

About the Author

Kenny Kerr is a computer programmer based in Canada, an author for Pluralsight and a Microsoft MVP. He blogs at kennykerr.ca and you can follow him on Twitter at twitter.com/kennykerr.

comments powered by Disqus

Featured

  • Get Started Using .NET Aspire with SQL Server & Azure SQL Database

    Microsoft experts are making the rounds educating developers about the company's new, opinionated, cloud-ready stack for building observable, production ready, distributed, cloud-native applications with .NET.

  • Microsoft Revamps Fledgling AutoGen Framework for Agentic AI

    Only at v0.4, Microsoft's AutoGen framework for agentic AI -- the hottest new trend in AI development -- has already undergone a complete revamp, going to an asynchronous, event-driven architecture.

  • IDE Irony: Coding Errors Cause 'Critical' Vulnerability in Visual Studio

    In a larger-than-normal Patch Tuesday, Microsoft warned of a "critical" vulnerability in Visual Studio that should be fixed immediately if automatic patching isn't enabled, ironically caused by coding errors.

  • Building Blazor Applications

    A trio of Blazor experts will conduct a full-day workshop for devs to learn everything about the tech a a March developer conference in Las Vegas keynoted by Microsoft execs and featuring many Microsoft devs.

  • Gradient Boosting Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the gradient boosting regression technique, where the goal is to predict a single numeric value. Compared to existing library implementations of gradient boosting regression, a from-scratch implementation allows much easier customization and integration with other .NET systems.

Subscribe on YouTube