Modern C++

Compiler Basics, Part 2: Building the Scanner

Kenny Kerr continues his series about compiler fundamentals by considering how the original source code may be converted into a stream of tokens.

More in this series:

Have you ever thought of writing your own compiler? This is Part 2 of a series on compiler basics. Check out Part 1, where I begin by demonstrating a simple yet efficient way of reading the original source code into memory using a memory-mapped file. Compiler texts often explore a variety of buffering strategies for loading source code, but the reality is that memory-mapped files are incredibly hard to beat in terms of both performance and simplicity.

In this installment I want to introduce the basics of lexical analysis. As I mentioned in my previous column, this is just a fancy term referring to the process of turning some source code into a stream of tokens. Specifically, I need to convert a stream of characters into such a stream of tokens. This is an abstraction of sorts, greatly simplifying subsequent stages of compilation.

If you read a compiler textbook, you'll discover that lexical analysis, specifically the part of the compiler that performs this task, goes by many different names. I'm just going to call it a scanner. Before I look at the implementation of the scanner, I need to define some tokens. I'm going to define a small set of tokens for the time being. Here's an enum class that covers some essential tokens for simple mathematical expressions:

enum class Token
{
  Invalid,
  EndOfFile,

  Number,
  Add,
  Subtract,
  Multiply,
  Divide,
  LeftParenthesis,
  RightParenthesis,
};

If the scanner comes across a character or character sequence that it doesn't recognize, it should use Token::Invalid to report as much. If it reaches the end of the source file, then Token::EndOfFile should be returned. The remaining Token members represent actual tokens and are self-explanatory.

In the same way that the FileMap class in my previous column hid the details for file mapping from my main program, I'm going to write a Scanner class to abstract or hide the details of lexical analysis from the rest of the compiler. I'll begin again by stylizing the code that I'd like to write. First, I'd like to load the source code and use the resulting FileMap object to initialize the Scanner class:

FileMap file(L"C:\\temp\\Sample.b");

if (!file)
{
  printf("Error: failed to open file\n");
  return 1;
}

Scanner scanner(file.Begin(), file.End());

Next, I should be able to simply print out each token in turn, terminating when the end of the file is reached or an invalid token is encountered. Something like what's shown in Listing 1.

Listing 1: Enumerating Tokens
for (;;)
{
  Token const token = scanner.Read();

  if (token == Token::Invalid)
  {
    printf("Invalid\n");
    break;
  }
  else if (token == Token::EndOfFile)
  {
    printf("End of file\n");
    break;
  }
  else if (token == Token::Number)
  {
    printf("%.2f\n", scanner.GetNumber());
  }
  else if (token == Token::Add)
  {
    printf("+\n");
  }
  else if (token == Token::Subtract)
  {
    printf("-\n");
  }
  // And so on ...
}

Clearly, the bulk of the work is going to go into writing the Read method of the Scanner class. The Scanner class essentially needs to keep track of the scanning position as well as any non-trivial tokens that may have been scanned. By non-trivial I mean any token that might have an associated value. At this point, the only one that qualifies is Token::Number. It might be sufficient to tell the caller that a plus character (+) translates into Token::Add, but the caller will want to know what the value of any Token::Number happens to be.

Therefore, the Scanner class needs to keep track of a few variables:

class Scanner
{
  char  const * const m_end    = nullptr;
  char  const *       m_next   = nullptr;
  float               m_number = 0.0f;

  // More to follow ...
};

The m_end member variable represents the constant bounds of the source file being scanned. This is essentially the end of the file. If the scanner reaches this point, then Token::EndOfFile should be returned by the Read method. Notice that both the pointer and what's pointed to are const -- there's no reason for either to be modified. Of course, that also means that the compiler won't be able to generate the default copy operations, and is sure to whine about that. This is easily solved by explicitly deleting the special copy operations, because they won't be needed. Here's how you can do that:

Scanner(Scanner const &) = delete;
Scanner & operator=(Scanner const &) = delete;

Although not required, it's best to place these declarations in the public part of the class. This tends to improve the diagnostics provided by the compiler. The m_next member variable points to a const character string, but the pointer itself is not const. While it can initially default to the start of the source file, it must also keep track of the current position -- in other words, the next character to scan. Both m_end and m_next can be initialized to something more meaningful inside the constructor:

Scanner(char const * const begin,
        char const * const end) throw() :
  m_end(end),
  m_next(begin)
{}

Initialization in the constructor takes precedence over any initializing value at the point where the member variable is declared. For this reason, I try to make it a habit to initialize member variables at the point of declaration, just in case I accidentally miss one of them in a constructor.

So the m_next member variable keeps track of the current scanning position and may be updated as needed. One common requirement is to update the scanner to skip over any whitespace characters. This can be done by simply scanning ahead until a non-whitespace character is reached. The CRT isspace function will do for the time being (see Listing 2).

Listing 2: Skipping Whitespace
void SkipWhitespace() throw()
{
  while (m_next != m_end)
  {
    if (isspace(*m_next))
    {
      ++m_next;
    }
    else
    {
      break;
    }
  }
}

The SkipWhitespace method moves the m_next pointer forward, so that it either points to the next non-whitespace character to be scanned or points to the end of the file. If no whitespace characters are found, it simply returns without updating the Scanner object. The SkipWhitespace method is just an implementation detail of the Scanner's Read method, and should be private. As you'd expect, the Read method begins like this:

Token Read() throw()
{
  SkipWhitespace();

  // More to follow ...

  return Token::Invalid;
}

The next thing the scanner must do is check whether there are even any characters to scan. While the FileMap class I explored in last month's column will not allow empty files to be loaded, eventually the Read method will reach the end of the file. There's certainly no point in carrying on if there are no further characters to analyze:

if (m_next == m_end)
{
  return Token::EndOfFile;
}

Assuming at least one character remains, I'll just grab a local copy for convenience:

char const c = *m_next;

I can then compare this character with some of the simpler tokens that have well-known representations but no associated value:

if ('+' == c)
{
  ++m_next;
  return Token::Add;
}

Here, I'm comparing the character against the symbol for adding two numbers together. If the characters are equal I don't need to scan any further, and can confidently inform the caller that Token::Add is the next token in the sequence. Of course, I need to remember to first increment the m_next member variable so that any subsequent call to the Read method will not simply rescan the same token. The same goes for the remaining trivial tokens, as shown in Listing 3.

Listing 3: Matching Additional Tokens
if ('-' == c)
{
  ++m_next;
  return Token::Subtract;
}

if ('*' == c)
{
  ++m_next;
  return Token::Multiply;
}

if ('/' == c)
{
  ++m_next;
  return Token::Divide;
}

// And so on ...

The only remaining token worth mentioning is that of the number. It would be quite dreary if I could only support single-digit numbers, so I'll instead write a simple parser for floating-point values. This takes a bit of thought, but is a good exercise in lexical analysis. To keep things simple, I'll restrict the scanner to some common representations supported by C++. Obviously, "123.456" should literally be 123.456 -- but there are a few edge cases that programmers take for granted but, perhaps, don't think too much about. All of these are valid in C++:

"1"    -> 1.0
"1."   -> 1.0
"1.0"  -> 1.0
".1"   -> 0.1

This, however, isn't valid:

"."    -> Syntax error!

Thinking about this prior to coding can help avoid a lot of debugging. The obvious stumbling block is a single decimal point standing on its own. While a number may begin, end and contain a decimal point, it does require at least one digit. That last example isn't a valid floating-point value in C++, so I'll make sure the scanner doesn't get confused by it, either.

You might be tempted to use the CRT strtof (string to float) function, as follows:

if ('.' == c || '0' <= c && '9' >= c)
{
  m_number = strtof(m_next, const_cast<char **>(&m_next));
  return Token::Number;
}

At first glance this seems reasonable enough. If the next character is a point or a digit, use the strtof function to interpret the floating-point value. Except for the annoying lack of const-correctness, even in the C++ Standard Library's version of this function, this really doesn't solve the problem. Sure, it works well most of the time. It simply and accurately interprets valid floating-point values in the source file. And it updates the m_next variable to point to the next character to be scanned. But when it comes to invalid characters, it's far from simple. Once you've added all of the pre- and post- processing to deal with all of the corner cases for which strtof either doesn't cater to or doesn't easily report as an error, you've lost much of the simplicity of this one-liner. I'll instead write my own method to handle this class of token:

if ('.' == c || '0' <= c && '9' >= c)
{
  return ReadNumber();
}

In this way, the ReadNumber method can deal with any invalid input, either returning Token::Number or Token::Invalid. It can store the token's value in the m_number member variable of the class, and update the m_next pointer as the strtof function would've done. That's not to say it's entirely simple: The ReadNumber method, which can join SkipWhitespace in the private part of the class, needs to keep track of a bit of state:

Token ReadNumber() throw()
{
  bool digit = false;
  bool point = false;
  unsigned divide = 1;

  // More to follow ...

  return Token::Invalid;
}

The digit local variable will tell me whether I've encountered a decimal digit at some point. I need at least one of those to have a valid floating-point value. The point local variable will tell me whether I've come across the decimal point. I'll need to use that to know by how much to adjust the final whole number, thinking in base 10 notation. Finally, that divide local variable will keep track of the number of decimal places with which to shift the number once scanning completes. Stick with me here -- it's not as complex as it might sound. I'll use the m_number member variable as part of the process of reading the number, so I'll start by initializing it to zero:

m_number = 0.0f;

The ReadNumber method is a bit like the generic Read method in that it must be able to scan a number of characters without overflowing the bounds of the source file. A simple loop will suffice:

for (char c = *m_next;
     m_next != m_end;
     c = *++m_next)
{
  // Scan for digits and decimal point ...
}

This just keeps the m_next variable moving forward, avoids reading beyond the end of the file, and provides a simple character variable to which the loop can refer. Inside the loop is where I need to distinguish between decimal digits, the decimal point and any other character that will signal the end of the number:

if ('0' <= c && '9' >= c)
{
  // More to follow ...
}
else if ('.' == c)
{
  point = true;
}
else
{
  break;
}

If a digit character is encountered, I need to flag the method's digit local variable and fold in the new digit. If the decimal point was previously encountered, I update the divide variable. Here's what it looks like:

digit = true;
m_number = 10 * m_number + (c - '0');

if (point)
{
  divide *= 10;
}

Finally, with the loop concluded, I need to check whether a digit was encountered. If so, I divide the number and return Token::Number to indicate success:

if (digit)
{
  m_number /= divide;
  return Token::Number;
}

And that's all I have time for this month. Given the following source file:

> type c:\temp\sample.b
1.2 + 3. * .4 / (4 + 5)

The program comes up with the expected output shown in Listing 4.

Listing 4: The Expected Output from the Program
> Sample.exe c:\temp\sample.b
1.20
+
3.00
*
0.40
/
(
4.00
+
5.00
)
End of file

The Scanner class is looking good, but there's still room for improvement. Did you notice it can't handle negative numbers? What about other types? Is a forward-only scanner sufficient? How could the SkipWhitespace method be improved? Clearly, there's a lot more to explore. What about performance? The implementation of ReadNumber is probably slower than most implementations of strtof, but it isn't less accurate. Remember that fast code that's incorrect is of little use. First, ensure the code is correct. Only then should you profile to find out where you can improve performance. There's no point optimizing code that doesn't even do what it's intended to do. Listing 5 shows the complete Scanner class, for clarity.

Join me next month, when I consider a simple interpreter and operator precedence, as I continue to explore compiler basics.

Listing 5: The Scanner Class
class Scanner
{
  char  const * const m_end    = nullptr;
  char  const *       m_next   = nullptr;
  float               m_number = 0.0f;

  void SkipWhitespace() throw()
  {
    while (m_next != m_end)
    {
      if (isspace(*m_next))
      {
        ++m_next;
      }
      else
      {
        break;
      }
    }
  }

  Token ReadNumber() throw()
  {
    bool digit      = false;
    bool point      = false;
    unsigned divide = 1;
    m_number        = 0.0f;

    for (char c = *m_next;
         m_next != m_end;
         c = *++m_next)
    {
      if ('0' <= c && '9' >= c)
      {
        digit = true;
        m_number = 10 * m_number + (c - '0');

        if (point)
        {
          divide *= 10;
        }
      }
      else if ('.' == c)
      {
        point = true;
      }
      else
      {
        break;
      }
    }

    if (digit)
    {
      m_number /= divide;
      return Token::Number;
    }

    return Token::Invalid;
  }

public:

  Scanner(Scanner const &) = delete;
  Scanner & operator=(Scanner const &) = delete;

  Scanner(char const * const begin,
          char const * const end) throw() :
    m_end(end),
    m_next(begin)
  {}

  float GetNumber() const throw()
  {
    return m_number;
  }

  Token Read() throw()
  {
    SkipWhitespace();

    if (m_next == m_end)
    {
      return Token::EndOfFile;
    }

    char const c = *m_next;

    if ('+' == c)
    {
      ++m_next;
      return Token::Add;
    }

    if ('-' == c)
    {
      ++m_next;
      return Token::Subtract;
    }

    if ('*' == c)
    {
      ++m_next;
      return Token::Multiply;
    }

    if ('/' == c)
    {
      ++m_next;
      return Token::Divide;
    }

    if ('(' == c)
    {
      ++m_next;
      return Token::LeftParenthesis;
    }

    if (')' == c)
    {
      ++m_next;
      return Token::RightParenthesis;
    }

    if ('.' == c || '0' <= c && '9' >= c)
    {
      return ReadNumber();
    }

    return Token::Invalid;
  }
};
comments powered by Disqus

Featured

  • Get Started Using .NET Aspire with SQL Server & Azure SQL Database

    Microsoft experts are making the rounds educating developers about the company's new, opinionated, cloud-ready stack for building observable, production ready, distributed, cloud-native applications with .NET.

  • Microsoft Revamps Fledgling AutoGen Framework for Agentic AI

    Only at v0.4, Microsoft's AutoGen framework for agentic AI -- the hottest new trend in AI development -- has already undergone a complete revamp, going to an asynchronous, event-driven architecture.

  • IDE Irony: Coding Errors Cause 'Critical' Vulnerability in Visual Studio

    In a larger-than-normal Patch Tuesday, Microsoft warned of a "critical" vulnerability in Visual Studio that should be fixed immediately if automatic patching isn't enabled, ironically caused by coding errors.

  • Building Blazor Applications

    A trio of Blazor experts will conduct a full-day workshop for devs to learn everything about the tech a a March developer conference in Las Vegas keynoted by Microsoft execs and featuring many Microsoft devs.

  • Gradient Boosting Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the gradient boosting regression technique, where the goal is to predict a single numeric value. Compared to existing library implementations of gradient boosting regression, a from-scratch implementation allows much easier customization and integration with other .NET systems.

Subscribe on YouTube