How To Write Your Own Compiler, Part 1: Mapping Source Files -- Visual Studio Magazine

How To Write Your Own Compiler, Part 1: Mapping Source Files

Kenny Kerr starts his new series about compiler basics by looking at how to use memory-mapped files to read the original source code.

By Kenny Kerr
05/07/2014

More in this series:

Compiler Basics, Part 2: Building the Scanner

Have you ever thought of writing your own compiler? There are a number of good reasons to do this. It's incredibly educational and can be useful. It can solve problems and produce abstractions that can simplify some previously complex procedures.

However, there are just as many reasons not to write your own compiler. Good compilers already exist. If you're looking for a scripting language, there are good embeddable compilers that suit many scenarios. Still, if you love to program, chances are you've always had some romantic fascination with compilers and compiler development. When I first started exploring compiler development, I was frustrated that the available material tended toward the high end. So you want to write a C++ compiler! Just parsing C++ is a monumental effort. Most programmers can find endless entertainment writing a compiler for a simple Basic-style dialect. It's a great place to start because you can get a lot of practical experience without having to imbibe a lot of theory. Plus, you can often write it directly in C++ and not have to resort to compiler generators.

Over the next few months, I'm going to explore a few foundational concepts in writing simple compilers. I'm going to look at concepts such as lexical analysis (a fancy term referring to the process of turning some source code into a stream of tokens). Then, I'll show you how to write a simple interpreter before moving on to code generation for a stack-based virtual machine. I'm going to keep things simple and practical, glossing over many topics with which commercial compilers need to deal. At the end of the day, you should have enough experience to write useful compilers that can replace poor-performing interpreters in a number of applications. I might also explore ready-to-use compilers and scripting languages such as JavaScript and Lua. But I want to start by showing you what you can produce on your own.

The first thing I need is a way to read the original source code. Although I'll begin by parsing one-liners, you'll eventually want to load larger source files, so I'll do it properly from the start. I need a simple abstraction to load the contents of a text file into memory. A memory-mapped file is often the best choice. Of course, I want to hide the details away from my main program. I often suggest you write the code you'd like to use, before diving into the implementation. Perhaps something like this:

int main()
{
  FileMap file(L"C:\\temp\\Sample.b");

  if (!file)
  {
    printf("Error: failed to open file\n");
    return 1;
  }

  printf("Source: %.*s\n", file.Size(), file.Begin());

  Scanner scanner(file.Begin(), file.End());

  // ...
}

Given this compelling example, I need a FileMap class that maps the contents of the given file inside its constructor. I should be able to use the explicit Boolean operator of the class to check for failure. Because this will be implemented with the Windows API, the main function could even use the GetLastError function to determine the specific reason for the failure. The FileMap class should provide a Size method, as well as Begin and End methods, for convenient access to the file contents. And I'll just assume ASCII source code for simplicity.

This is going to be relatively simple with my unique_handle class template that can be downloaded here. I'll start by defining the FileMap class with only two pointers for members:

class FileMap
{
  char const * m_begin;
  char const * m_end;

  // ...
};

As with most resource managers, I'll disable copy semantics by placing this in the public part of the class:

FileMap(FileMap const &) = delete;
FileMap & operator=(FileMap const &) = delete;

Although these declarations could live in the private part of the class, the resulting error messages for violating callers will be less clear. I'll enable move semantics, again by placing the members shown in Listing 1 in the public part of the class.

Listing 1: Enabling Move Semantics

FileMap(FileMap && other) throw()
  : m_begin(other.m_begin),
    m_end(other.m_end)
{
  other.m_begin = nullptr;
  other.m_end = nullptr;
}

FileMap & operator=(FileMap && other) throw()
{
  if (this != &other)
  {
    m_begin = other.m_begin;
    m_end   = other.m_end;

    other.m_begin = nullptr;
    other.m_end = nullptr;
  }

  return *this;
}

I happen to be using the Visual Studio 2013 compiler, so although there are some C++11 features I can make use of, others are still lacking. Notably, I'm unable to use the noexcept specifier and instead use the deprecated throw exception specification. I'm also unable to use a defaulted move constructor and move assignment operator, which the next version of the Visual C++ compiler should be able to provide for me. That would avoid a lot of boilerplate code.

Now, the particular mechanics. A memory-mapped file involves three objects or resources. First, you need to create the file object representing the file on disk. Then, you need to use the resulting file handle to create a file mapping object. Next, you need to map a view of the file using the file mapping handle. Although the API makes it appear as if the view is just a pointer, that pointer is really a resource that needs managing, just like the file and file mapping objects. This might seem like a lot of work. The good news is that the OS reference counts these objects, which means you only need to hold on to the view and the system will take care of destroying the file and file mapping object in due course. However, you do need to close any handles you create. That's where the unique_handle class template comes in. I'll begin by opening the file in an explicit constructor:

explicit FileMap(wchar_t const * filename) throw() :
  m_begin(0),
  m_end(0)
{
  // ...
}

In the body of this constructor I can open the source file by calling the CreateFile function and passing the resulting handle directly to a unique_handle:

invalid_handle file
(
  CreateFile(filename,
             GENERIC_READ,
             FILE_SHARE_READ,
             nullptr,
             OPEN_EXISTING,
             FILE_ATTRIBUTE_NORMAL,
             nullptr)
);

Here I'm using invalid_handle, which is just a type alias provided by the handle.h header file, providing the appropriate traits for file handles. Recall that CreateFile returns INVALID_HANDLE_VALUE on failure. If the name "invalid_handle" throws you off, you can define your own typedef or type alias to make things clear, perhaps calling it "file_handle."

Given that the file handle is local to the constructor, I'd better use it directly before it goes out of scope and is closed. First, I'll check that the file was opened successfully:

if (!file) return;

The caller can deal with failure by using the explicit Boolean operator of the FileMap class. The GetLastError function can also be used to gather further information. Next, I can create the file mapping object with the following file handle:

null_handle map
(
  CreateFileMapping(file.get(),
                    nullptr,
                    PAGE_READONLY,
                    0, 0, 
                    nullptr)
);

Again, I need to check this was successful:

if (!map) return;

Like the file handle, the map handle is defined local to the constructor. Before creating the view, which the FileMap object will hold on to, I need to get the size of the file:

LARGE_INTEGER size = {};

if (!GetFileSizeEx(file.get(), &size)) return;

The size isn't needed to create the view, but dealing with this possible failure first will keep the error handling around the view nice and simple. Now I can create or map the view of the file and assign the result directly to the FileMap m_begin member:

m_begin = static_cast<char *>(MapViewOfFile(map.get(),
                                            FILE_MAP_READ,
                                            0, 0, // offset
                                            0));  // size

It's this m_begin pointer that represents the resource the FileMap class must take care of. But this, too, may fail:

if (!m_begin) return;

If all goes well, I can use the previously determined file size to mark the end of the view or the end of the file in memory:

m_end = m_begin + size.QuadPart;

So that's the constructor that does all the hard work of mapping the file into memory. The file will be available in memory until the view is unmapped, so I'll make that happen in the FileMap destructor, provided it was successfully mapped to begin with:

~FileMap() throw()
{
  if (m_begin)
  {
    VERIFY(UnmapViewOfFile(m_begin));
  }
}

The rest of the class is quite straightforward. I need the explicit Boolean operator the caller must use to determine whether the file was mapped successfully:

explicit operator bool() const throw()
{
  return m_begin != nullptr;
}

Now I need a Size method. This isn't strictly necessary, but it's mighty convenient:

size_t Size() const throw()
{
  return m_end - m_begin;
}

If you're a purist, you might make this a non-member function, and that wouldn't be a bad idea.

Finally, here's the Begin and End methods that define the half-open range of characters in the source file:

char const * Begin() const throw()
{
  return m_begin;
}

char const * End() const throw()
{
  return m_end;
}

And that takes care of mapping source files into memory, by far the simplest and most efficient way of getting at the source code that needs to be scanned. Now, given a source file such as this:

> type c:\temp\sample.b
1.2 + .23 * 3 / 1

I can write a simple program to map the file as follows:

int main()
{
  FileMap file(L"C:\\temp\\Sample.b");

  if (!file)
  {
    printf("Error: failed to open file\n");
    return 1;
  }

  printf("Source: %.*s\n", file.Size(), file.Begin());
}

And I get the expected results:

> Sample.exe
Source: 1.2 + .23 * 3 / 1

Listing 2 shows the complete FileMap class.

Listing 2: The File Map Class

class FileMap
{
    char const * m_begin;
    char const * m_end;

public:

    FileMap(FileMap const &) = delete;
    FileMap & operator=(FileMap const &) = delete;

    FileMap(FileMap && other) throw() :
        m_begin(other.m_begin),
        m_end(other.m_end)
    {
        other.m_begin = nullptr;
        other.m_end = nullptr;
    }

    FileMap & operator=(FileMap && other) throw()
    {
        if (this != &other)
        {
            m_begin = other.m_begin;
            m_end   = other.m_end;

            other.m_begin = nullptr;
            other.m_end = nullptr;
        }

        return *this;
    }

    explicit FileMap(wchar_t const * filename) throw() :
        m_begin(0),
        m_end(0)
    {
        invalid_handle file
        (
            CreateFile(filename,
                       GENERIC_READ,
                       FILE_SHARE_READ,
                       nullptr,
                       OPEN_EXISTING,
                       FILE_ATTRIBUTE_NORMAL,
                       nullptr)
        );

        if (!file) return;

        null_handle map
        (
            CreateFileMapping(file.get(),
                              nullptr,
                              PAGE_READONLY,
                              0, 0, 
                              nullptr)
        );

        if (!map) return;

        LARGE_INTEGER size = {};

        if (!GetFileSizeEx(file.get(), &size)) return;

        m_begin = static_cast<char *>(MapViewOfFile(map.get(),
                                                    FILE_MAP_READ,
                                                    0, 0, // offset
                                                    0));  // size

        if (!m_begin) return;

        m_end = m_begin + size.QuadPart;
    }

    ~FileMap() throw()
    {
        if (m_begin)
        {
            VERIFY(UnmapViewOfFile(m_begin));
        }
    }

    explicit operator bool() const throw()
    {
        return m_begin != nullptr;
    }

    size_t Size() const throw()
    {
        return m_end - m_begin;
    }

    char const * Begin() const throw()
    {
        return m_begin;
    }

    char const * End() const throw()
    {
        return m_end;
    }
};

And that's all for this month. Join me next month as I explore lexical analysis, the process of turning this pile of characters into a sequence of tokens with which the compiler can reason.

About the Author

Kenny Kerr is a computer programmer based in Canada, an author for Pluralsight and a Microsoft MVP. He blogs at kennykerr.ca and you can follow him on Twitter at twitter.com/kennykerr.

Printable Format

comments powered by Disqus

Featured

Low-Code Report Says AI Will Enhance, Not Replace DIY Dev Tools

Along with replacing software developers and possibly killing humanity, advanced AI is seen by many as a death knell for the do-it-yourself, low-code/no-code tooling industry, but a new report belies that notion.
Vibe Coding with Latest Visual Studio Preview

Microsoft's latest Visual Studio preview facilitates "vibe coding," where developers mainly use GitHub Copilot AI to do all the programming in accordance with spoken or typed instructions.
Steve Sanderson Previews AI App Dev: Small Models, Agents and a Blazor Voice Assistant

Blazor creator Steve Sanderson presented a keynote at the recent NDC London 2025 conference where he previewed the future of .NET application development with smaller AI models and autonomous agents, along with showcasing a new Blazor voice assistant project demonstrating cutting-edge functionality.
Microsoft Closes Request for Universal UI Builder: 'It's Baffling'

Microsoft last week closed a feedback request for a universal UI builder as capable as WinForms, putting an end to a long-sought coding nirvana with a decision that angered some developers.
Azure AI Foundry Gets NVIDIA Tech

AI powerhouse NVIDIA flexed its muscle at its GTC 2025 conference this week where several partnerships with Microsoft were announced, mostly concerning Microsoft's Azure AI Foundry offering.