Modern C++

How To Write Your Own Compiler, Part 1: Mapping Source Files

Kenny Kerr starts his new series about compiler basics by looking at how to use memory-mapped files to read the original source code.

More in this series:

Have you ever thought of writing your own compiler? There are a number of good reasons to do this. It's incredibly educational and can be useful. It can solve problems and produce abstractions that can simplify some previously complex procedures.

However, there are just as many reasons not to write your own compiler. Good compilers already exist. If you're looking for a scripting language, there are good embeddable compilers that suit many scenarios. Still, if you love to program, chances are you've always had some romantic fascination with compilers and compiler development. When I first started exploring compiler development, I was frustrated that the available material tended toward the high end. So you want to write a C++ compiler! Just parsing C++ is a monumental effort. Most programmers can find endless entertainment writing a compiler for a simple Basic-style dialect. It's a great place to start because you can get a lot of practical experience without having to imbibe a lot of theory. Plus, you can often write it directly in C++ and not have to resort to compiler generators.

Over the next few months, I'm going to explore a few foundational concepts in writing simple compilers. I'm going to look at concepts such as lexical analysis (a fancy term referring to the process of turning some source code into a stream of tokens). Then, I'll show you how to write a simple interpreter before moving on to code generation for a stack-based virtual machine. I'm going to keep things simple and practical, glossing over many topics with which commercial compilers need to deal. At the end of the day, you should have enough experience to write useful compilers that can replace poor-performing interpreters in a number of applications. I might also explore ready-to-use compilers and scripting languages such as JavaScript and Lua. But I want to start by showing you what you can produce on your own.

The first thing I need is a way to read the original source code. Although I'll begin by parsing one-liners, you'll eventually want to load larger source files, so I'll do it properly from the start. I need a simple abstraction to load the contents of a text file into memory. A memory-mapped file is often the best choice. Of course, I want to hide the details away from my main program. I often suggest you write the code you'd like to use, before diving into the implementation. Perhaps something like this:

int main()
{
  FileMap file(L"C:\\temp\\Sample.b");

  if (!file)
  {
    printf("Error: failed to open file\n");
    return 1;
  }

  printf("Source: %.*s\n", file.Size(), file.Begin());

  Scanner scanner(file.Begin(), file.End());

  // ...
}

Given this compelling example, I need a FileMap class that maps the contents of the given file inside its constructor. I should be able to use the explicit Boolean operator of the class to check for failure. Because this will be implemented with the Windows API, the main function could even use the GetLastError function to determine the specific reason for the failure. The FileMap class should provide a Size method, as well as Begin and End methods, for convenient access to the file contents. And I'll just assume ASCII source code for simplicity.

This is going to be relatively simple with my unique_handle class template that can be downloaded here. I'll start by defining the FileMap class with only two pointers for members:

class FileMap
{
  char const * m_begin;
  char const * m_end;

  // ...
};

As with most resource managers, I'll disable copy semantics by placing this in the public part of the class:

FileMap(FileMap const &) = delete;
FileMap & operator=(FileMap const &) = delete;

Although these declarations could live in the private part of the class, the resulting error messages for violating callers will be less clear. I'll enable move semantics, again by placing the members shown in Listing 1 in the public part of the class.

Listing 1: Enabling Move Semantics
FileMap(FileMap && other) throw()
  : m_begin(other.m_begin),
    m_end(other.m_end)
{
  other.m_begin = nullptr;
  other.m_end = nullptr;
}

FileMap & operator=(FileMap && other) throw()
{
  if (this != &other)
  {
    m_begin = other.m_begin;
    m_end   = other.m_end;

    other.m_begin = nullptr;
    other.m_end = nullptr;
  }

  return *this;
}

I happen to be using the Visual Studio 2013 compiler, so although there are some C++11 features I can make use of, others are still lacking. Notably, I'm unable to use the noexcept specifier and instead use the deprecated throw exception specification. I'm also unable to use a defaulted move constructor and move assignment operator, which the next version of the Visual C++ compiler should be able to provide for me. That would avoid a lot of boilerplate code.

Now, the particular mechanics. A memory-mapped file involves three objects or resources. First, you need to create the file object representing the file on disk. Then, you need to use the resulting file handle to create a file mapping object. Next, you need to map a view of the file using the file mapping handle. Although the API makes it appear as if the view is just a pointer, that pointer is really a resource that needs managing, just like the file and file mapping objects. This might seem like a lot of work. The good news is that the OS reference counts these objects, which means you only need to hold on to the view and the system will take care of destroying the file and file mapping object in due course. However, you do need to close any handles you create. That's where the unique_handle class template comes in. I'll begin by opening the file in an explicit constructor:

explicit FileMap(wchar_t const * filename) throw() :
  m_begin(0),
  m_end(0)
{
  // ...
}

In the body of this constructor I can open the source file by calling the CreateFile function and passing the resulting handle directly to a unique_handle:

invalid_handle file
(
  CreateFile(filename,
             GENERIC_READ,
             FILE_SHARE_READ,
             nullptr,
             OPEN_EXISTING,
             FILE_ATTRIBUTE_NORMAL,
             nullptr)
);

Here I'm using invalid_handle, which is just a type alias provided by the handle.h header file, providing the appropriate traits for file handles. Recall that CreateFile returns INVALID_HANDLE_VALUE on failure. If the name "invalid_handle" throws you off, you can define your own typedef or type alias to make things clear, perhaps calling it "file_handle."

Given that the file handle is local to the constructor, I'd better use it directly before it goes out of scope and is closed. First, I'll check that the file was opened successfully:

if (!file) return;

The caller can deal with failure by using the explicit Boolean operator of the FileMap class. The GetLastError function can also be used to gather further information. Next, I can create the file mapping object with the following file handle:

null_handle map
(
  CreateFileMapping(file.get(),
                    nullptr,
                    PAGE_READONLY,
                    0, 0, 
                    nullptr)
);

Again, I need to check this was successful:

if (!map) return;

Like the file handle, the map handle is defined local to the constructor. Before creating the view, which the FileMap object will hold on to, I need to get the size of the file:

LARGE_INTEGER size = {};

if (!GetFileSizeEx(file.get(), &size)) return;

The size isn't needed to create the view, but dealing with this possible failure first will keep the error handling around the view nice and simple. Now I can create or map the view of the file and assign the result directly to the FileMap m_begin member:

m_begin = static_cast<char *>(MapViewOfFile(map.get(),
                                            FILE_MAP_READ,
                                            0, 0, // offset
                                            0));  // size

It's this m_begin pointer that represents the resource the FileMap class must take care of. But this, too, may fail:

if (!m_begin) return;

If all goes well, I can use the previously determined file size to mark the end of the view or the end of the file in memory:

m_end = m_begin + size.QuadPart;

So that's the constructor that does all the hard work of mapping the file into memory. The file will be available in memory until the view is unmapped, so I'll make that happen in the FileMap destructor, provided it was successfully mapped to begin with:

~FileMap() throw()
{
  if (m_begin)
  {
    VERIFY(UnmapViewOfFile(m_begin));
  }
}

The rest of the class is quite straightforward. I need the explicit Boolean operator the caller must use to determine whether the file was mapped successfully:

explicit operator bool() const throw()
{
  return m_begin != nullptr;
}

Now I need a Size method. This isn't strictly necessary, but it's mighty convenient:

size_t Size() const throw()
{
  return m_end - m_begin;
}

If you're a purist, you might make this a non-member function, and that wouldn't be a bad idea.

Finally, here's the Begin and End methods that define the half-open range of characters in the source file:

char const * Begin() const throw()
{
  return m_begin;
}

char const * End() const throw()
{
  return m_end;
}

And that takes care of mapping source files into memory, by far the simplest and most efficient way of getting at the source code that needs to be scanned. Now, given a source file such as this:

> type c:\temp\sample.b
1.2 + .23 * 3 / 1

I can write a simple program to map the file as follows:

int main()
{
  FileMap file(L"C:\\temp\\Sample.b");

  if (!file)
  {
    printf("Error: failed to open file\n");
    return 1;
  }

  printf("Source: %.*s\n", file.Size(), file.Begin());
}

And I get the expected results:

> Sample.exe
Source: 1.2 + .23 * 3 / 1

Listing 2 shows the complete FileMap class.

Listing 2: The File Map Class
class FileMap
{
    char const * m_begin;
    char const * m_end;

public:

    FileMap(FileMap const &) = delete;
    FileMap & operator=(FileMap const &) = delete;

    FileMap(FileMap && other) throw() :
        m_begin(other.m_begin),
        m_end(other.m_end)
    {
        other.m_begin = nullptr;
        other.m_end = nullptr;
    }

    FileMap & operator=(FileMap && other) throw()
    {
        if (this != &other)
        {
            m_begin = other.m_begin;
            m_end   = other.m_end;

            other.m_begin = nullptr;
            other.m_end = nullptr;
        }

        return *this;
    }

    explicit FileMap(wchar_t const * filename) throw() :
        m_begin(0),
        m_end(0)
    {
        invalid_handle file
        (
            CreateFile(filename,
                       GENERIC_READ,
                       FILE_SHARE_READ,
                       nullptr,
                       OPEN_EXISTING,
                       FILE_ATTRIBUTE_NORMAL,
                       nullptr)
        );

        if (!file) return;

        null_handle map
        (
            CreateFileMapping(file.get(),
                              nullptr,
                              PAGE_READONLY,
                              0, 0, 
                              nullptr)
        );

        if (!map) return;

        LARGE_INTEGER size = {};

        if (!GetFileSizeEx(file.get(), &size)) return;

        m_begin = static_cast<char *>(MapViewOfFile(map.get(),
                                                    FILE_MAP_READ,
                                                    0, 0, // offset
                                                    0));  // size

        if (!m_begin) return;

        m_end = m_begin + size.QuadPart;
    }

    ~FileMap() throw()
    {
        if (m_begin)
        {
            VERIFY(UnmapViewOfFile(m_begin));
        }
    }

    explicit operator bool() const throw()
    {
        return m_begin != nullptr;
    }

    size_t Size() const throw()
    {
        return m_end - m_begin;
    }

    char const * Begin() const throw()
    {
        return m_begin;
    }

    char const * End() const throw()
    {
        return m_end;
    }
};

And that's all for this month. Join me next month as I explore lexical analysis, the process of turning this pile of characters into a sequence of tokens with which the compiler can reason.

About the Author

Kenny Kerr is a computer programmer based in Canada, an author for Pluralsight and a Microsoft MVP. He blogs at kennykerr.ca and you can follow him on Twitter at twitter.com/kennykerr.

comments powered by Disqus

Featured

  • AdaBoost Binary Classification Using C#

    Dr. James McCaffrey from Microsoft Research presents a C# program that illustrates using the AdaBoost algorithm to perform binary classification for spam detection. Compared to other classification algorithms, AdaBoost is powerful and works well with small datasets, but is sometimes susceptible to model overfitting.

  • From Core to Containers to Orchestration: Modernizing Your Azure Compute

    The cloud changed IT forever. And then containers changed the cloud. And then Kubernetes changed containers. And then microservices usurped monoliths, and so it goes in the cloudscape. Here's help to sort it all out.

  • The Well-Architected Architect on Azure

    In the dynamic field of cloud computing, the architect's role is increasingly pivotal as they must navigate a complex landscape, considering everything from the overarching architecture and individual service configurations to the various trade-offs involved. Here's help.

  • Windows Community Toolkit Update Improves Controls

    The Windows Community Toolkit advanced to version 8.1, adding new features, improving existing controls and making dependency changes.

Subscribe on YouTube