tsPEG - A Fully Featured TypeScript Parser Generator
As part of my final year project, I needed to generate a parser in TypeScript. However I couldn’t find a parser generator that I thought would do the trick.
There are some great JavaScript parser generators, with TypeScript plugins, out in the world, but they didn’t quite generate the strictly typed ASTs that I wanted.
So I decided to make my own, namely tsPEG.
Making my own
Unsurprisingly, rolling my own parser generator turned out to be no easy task. I took a large part of my approach from Guido Van Rossum’s fantastic series of blog posts about PEG Parsers.
Bootstrapping and Self-Generation
The approach I took was to first hand write a small parser that could read a simple grammar, and generate a very basic parser from them, in the same fashion. Once the parser was powerful enough, it could generate it’s own parser. This self-generation admits a powerful bootstrapping process to add the new features we want to the language.
Each subsequent version of the parser then added more and more complex features. Starting with regex based lexing, then Kleene closure operators, optionals, and then positive and negative lookahead operators.
The Generator
tspeg
takes in a grammar described in a simple familiar syntax, here is an example syntax for simple arithmetic expression.
SUM := head=FAC tail={ op='\+|-' sm=FAC }*
FAC := head=ATOM tail={ op='\*|/' sm=ATOM }*
ATOM := val=INT
| '\(' val=SUM '\)'
INT := val='[0-9]+'
tspeg
takes a syntax description like this and turns it into a well typed parser that can take in any arithmetic expression and parse it, with correct precedence and associativity.
Syntax
The meta-grammar syntax is self-described as follows.
GRAM := RULEDEF+
RULEDEF := _ name=NAME _ ':=' _ rule=RULE _
RULE := head=ALT tail={_ '\|' _ alt=ALT }*
ALT := MATCHSPEC+
MATCHSPEC := _ named={name=NAME '='}? rule=POSTOP _
POSTOP := pre=PREOP op='\+|\*|\?'?
PREOP := op='\&|!'? at=ATOM
ATOM := name=NAME !'\s*:='
| match=STRLIT
| '{' _ sub=RULE _ '}'
NAME := '[a-zA-Z_]+'
STRLIT := '\'' val='([^\'\\]|(\\.))*' '\''
_ := '\s*'
- Grammar File
- A Grammar file consists of a sequence of rule definitions
- Rule
- A rule definition consists of a name followed by a rule description
- Rule description
- A rule description is given in the familiar format, a sequence of expressions, seperated by ‘|’s
- Expression
- An expression is a space separated sequence of match descriptions.
- Match Descriptions
- Match descriptions are either named or unnamed, named matches are preceded by a
<name>=
- Matches
- Matches can be just regex literals, e.g. ‘[a-z]+’, subrules
{ <rule> }
, or a reference to another named rule. They can also be modified by prefix and postfix operators.
Operators
+ postfix |
The + operator as a postfix operator modifies the match to greedily match one or more instances of the match |
* postfix |
The * operator acts the same as + , but matching zero or more instances |
? postfix |
The ? operator marks the preceding match as optional, failing to match an optional match will not cause the rule expression to fail |
& prefix |
The & operator causes the following match to not consume the input, even if the match is successful |
! prefix |
The ! operator causes the following match to succeed if and only if the match fails |