Writing a Compiler in C#: Lexical Analysis

Sasha Goldshtein

Oct. 06, 10 · News

Likes (0)

Comment

Save

29.6K Views

i’m going to write a compiler for a simple language. the compiler will be written in c#, and will have multiple back ends. the first back end will compile the source code to c, and use cl.exe (the visual c++ compiler) to produce an executable binary.

but first, a minor digression.

over my blogging years, i developed this tendency of abandoning blog post series just prior to their final installment. i abandoned the unit testing series , the primality testing series , and many other “series”.

therefore, i’m not going to call this thing a “series”. i might be able to write another post on the same subject, or i might not; if i don’t, i’ll post the whole lump of source code here and have you decide if it’s worth continuing on your own.

with that said, let’s start by introducing the language for which we’ll write a compiler. it’s called jack , and i haven’t made it up—it’s a teaching language used in the book the elements of computer systems by noam nissan and shimon shocken, with some minor modifications i introduced. the language is designed to make lexical analysis, parsing, and code generation as easy as possible. (indeed, the huji course from nand to tetris covers compiler construction in two lessons, and students complete a working jack compiler—to an intermediate vm representation—in slightly less than three weeks.)

next, the obligatory “hello world” program in jack:

class main {
    function void main() {
        do system.print("hello world!");
        do system.println();
    }
}

and now a more realistic example that demonstrates some of jack’s coding constructs:

class main {
    function array initnumbers() {
        var array numbers;
        let numbers = array.new(5);
        let numbers[0] = 13;
        let numbers[1] = 14;
        let numbers[2] = 41;
        let numbers[3] = 97;
        let numbers[4] = 101;
        return numbers;
    }
    function void main() {
        var array numbers;
        var int i, j;
        var boolean prime;
        let numbers = main.initnumbers();
        let j = 0;
        while (j < 5) {
            let i = 2;
            let prime = true;
            while (i < numbers[j] & prime) {
                if (numbers[j] % i = 0) {
                    do system.printint(numbers[j]);
                    do system.print(" is composite.");
                    do system.println();
                    let prime = false;
                }
                let i = i + 1;
            }
            if (prime) {
                    do system.printint(numbers[j]);
                    do system.print(" is prime.");
                    do system.println();
            }
            let j = j + 1;
        }
    }
}

this program’s output is:

c:\jackcompiler>jackcompiler.exe helloworld.jack

c:\jackcompiler>out
13 is prime.
14 is composite.
41 is prime.
97 is prime.
101 is prime.

assuming that we’re not interested in extraneous formalism, we can go ahead and think about the first part of the compiler—the lexical analyzer, or the tokenizer. the structure of a compiler is well-illustrated by the following diagram [ source ]:

before we attach semantic meaning to the language constructs, we have to get away with such details as skipping unnecessary whitespace, recognizing legal identifiers, separating symbols from keywords, and so on. this is the purpose of the lexical analyzer , which takes an input stream of characters and generates from it a stream of tokens , elements that can be processed by the parser . sometimes the parser constructs a parse tree (abstract syntax tree) or any other intermediate representation of the source code; at other times, the parser directly instructs the compiler back-end (or code generator) to synthesize the executable program.

normally, you wouldn’t write the lexical analyzer by hand. instead, you provide a tool such as flex with a list of regular expressions and rules, and obtain from it a working program capable of generating tokens. for example, the following regular expression recognizes all legal jack identifiers:

[_a-za-z][_a-za-z0-9]*

however, for didactic reasons, we will be rolling by hand our own lexical analyzer. it’s not a very challenging task, too—dealing with comments and extraneous whitespace is probably the hardest part.

the following is the primary method of our lexical analyzer. (the rest of its implementation was omitted for brevity.)

public void advance()
{
    eatwhitespace();

    if (isatend)
    {
        _done = true;
        return;
    }
    char nextchar = nextchar();
    if (syntax.issymbol(nextchar))
    {
        //this token is going to be a symbol. there are
        //three special look-ahead cases for '<=', '>=', 
        //and '!='.
        if ((new[] { '<', '>', '!' }.contains(nextchar))
            && lookahead() == '=')
        {
            nextchar();//eat the '='
            _currenttoken = new token(
                tokentype.symbol, nextchar + "=");
        }
        else
        {
            _currenttoken = new token(
                tokentype.symbol, nextchar.tostring());
        }
    }
    else if (syntax.isnumber(nextchar))
    {
        //this token is going to be an integer constant.
        string intconst = nextchar.tostring();
        intconst += eatwhile(syntax.isnumber);

        int result;
        if (!int.tryparse(intconst, out result))
        {
            throw new compilationexception(
            "int const must be in range [0,2147483648), " + 
            "but got: " + intconst, _currentline);
        }

        _currenttoken = new token(
            tokentype.intconst, intconst);
    }
    else if (syntax.ischarordinalstart(nextchar))
    {

        char marker = nextchar();
        if (marker == '\\')
        {
            string code = eatwhile(syntax.isnumber);
            if (code.length != 3)
            {
                throw new compilationexception(
                "expected: \\nnn where n are decimal digits",
                _currentline);
            }
            int value = int.parse(code);
            if (value >= 256)
            {
                throw new compilationexception(
                "character ordinal is out of range [0,255]",
                _currentline);
            }
            _currenttoken = new token(
                tokentype.intconst, value.tostring());
        }
        else
        {
            _currenttoken = new token(
                tokentype.intconst, ((int)marker).tostring());
        }
        nextchar();//swallow the end of the character ordinal
    }
    else if (syntax.isstringconstantstart(nextchar))
    {
        //this token is going to be a string constant.
        string strconst = eatwhile(
            c => !syntax.isstringconstantstart(c));
        nextchar();//swallow the end of the string constant
        _currenttoken = new token(
            tokentype.strconst, strconst);
    }
    else if (syntax.isstartofkeywordorident(nextchar))
    {

        string keywordorident = nextchar.tostring();
        keywordorident += eatwhile(
                          syntax.ispartofkeywordorident);
        if (syntax.iskeyword(keywordorident))
        {
            _currenttoken = new token(
                tokentype.keyword, keywordorident);
        }
        else
        {
            _currenttoken = new token(
                tokentype.ident, keywordorident);
        }
    }
    else
    {
        throw new compilationexception(
            "unexpected character: " + nextchar, _currentline);
    }
}

there are five interesting cases here from which five different token types can be generated:

a symbol [tokentype.symbol], which may contain two characters—explaining the need for additional look-ahead with ‘<’, ‘>’, and ‘!’. note that the additional look-ahead may fail if the symbol is placed at the end of the file, but this is not a legal language construct, anyway.
a numeric constant [tokentype.intconst]—we currently allow only integer constants, as the language doesn’t have floating-point support.
a character ordinal constant such as ‘h’ or ‘\032’—these are translated to numeric constants as in #2.
a literal string constant [tokentype.strconst] such as “hello world”—note that ‘”’ is not a legal character within a literal string constant. we leave it for now as a language limitation.
a keyword or an identifier [tokentype.keyword or tokentype.ident], matching the previously shown regular expression.

this lexical analyzer is rather “dumb”—it does not record identifier information anywhere, and it doesn’t provide access to anything but the current token. it turns out that we don’t need anything else for the current jack syntax—formally speaking, it is almost an ll(1) language, i.e. most of its language constructs can be parsed with only one look-ahead token. the single ll(2) exception is subroutine calls within expressions, and we’ll craft a special case in the parser to work around this limitation.

for the “hello world” program above, this lexical analyzer will produce the following sequence of tokens:

<keyword, class> <ident, main> <symbol, {>
<keyword, function> <keyword, void> <ident, main>
<symbol, (> <symbol, )> <keyword, do>
<ident, system> <symbol, .> <ident, print>
<symbol, (> <symbol, "> <strconst, hello world!>
<symbol, )> <symbol, ;> …

next time, some parsing.

some references if you want to keep reading while i’m writing the subsequent parts:

“the dragon book”— compilers: principles, techniques, and tools , the absolutely ultimate guide to compiler construction
let’s build a compiler —an informal guide to compiler construction in pascal; unfortunately, an unfinished series
compiler writing resources on stackoverflow.com

csharp

Opinions expressed by DZone contributors are their own.

Related

Trending

Writing a Compiler in C#: Lexical Analysis

Related

Partner Resources