tsPEG - A Fully Featured TypeScript Parser Generator

As part of my final year project, I needed to generate a parser in TypeScript. However I couldn’t find a parser generator that I thought would do the trick.

There are some great JavaScript parser generators, with TypeScript plugins, out in the world, but they didn’t quite generate the strictly typed ASTs that I wanted.

So I decided to make my own, namely tsPEG.

Making my own

Unsurprisingly, rolling my own parser generator turned out to be no easy task. I took a large part of my approach from Guido Van Rossum’s fantastic series of blog posts about PEG Parsers.

Bootstrapping and Self-Generation

The approach I took was to first hand write a small parser that could read a simple grammar, and generate a very basic parser from them, in the same fashion. Once the parser was powerful enough, it could generate it’s own parser. This self-generation admits a powerful bootstrapping process to add the new features we want to the language.

Each subsequent version of the parser then added more and more complex features. Starting with regex based lexing, then Kleene closure operators, optionals, and then positive and negative lookahead operators.

The Generator

tspeg takes in a grammar described in a simple familiar syntax, here is an example syntax for simple arithmetic expression.

SUM  := head=FAC tail={ op='\+|-' sm=FAC }*
FAC  := head=ATOM tail={ op='\*|/' sm=ATOM }*
ATOM := val=INT
      | '\(' val=SUM '\)'
INT  := val='[0-9]+'

tspeg takes a syntax description like this and turns it into a well typed parser that can take in any arithmetic expression and parse it, with correct precedence and associativity.

Syntax

The meta-grammar syntax is self-described as follows.

GRAM      := RULEDEF+
RULEDEF   := _ name=NAME _ ':=' _ rule=RULE _
RULE      := head=ALT tail={_ '\|' _ alt=ALT }*
ALT       := MATCHSPEC+
MATCHSPEC := _ named={name=NAME '='}? rule=POSTOP _
POSTOP    := pre=PREOP op='\+|\*|\?'?
PREOP     := op='\&|!'? at=ATOM
ATOM      := name=NAME !'\s*:='
           | match=STRLIT
           | '{' _ sub=RULE _ '}'
NAME      := '[a-zA-Z_]+'
STRLIT    := '\'' val='([^\'\\]|(\\.))*' '\''
_         := '\s*'

Grammar File: A Grammar file consists of a sequence of rule definitions
Rule: A rule definition consists of a name followed by a rule description
Rule description: A rule description is given in the familiar format, a sequence of expressions, seperated by ‘|’s
Expression: An expression is a space separated sequence of match descriptions.
Match Descriptions: Match descriptions are either named or unnamed, named matches are preceded by a <name>=
Matches: Matches can be just regex literals, e.g. ‘[a-z]+’, subrules { <rule> }, or a reference to another named rule. They can also be modified by prefix and postfix operators.

Operators

`+` postfix	The `+` operator as a postfix operator modifies the match to greedily match one or more instances of the match
`*` postfix	The `*` operator acts the same as `+`, but matching zero or more instances
`?` postfix	The `?` operator marks the preceding match as optional, failing to match an optional match will not cause the rule expression to fail
`&` prefix	The `&` operator causes the following match to not consume the input, even if the match is successful
`!` prefix	The `!` operator causes the following match to succeed if and only if the match fails