GitHub LinkedIn PGP CV

Writing a simple recursive descent parser in Python

Recursive descent parsing is a simple, powerful and expressive way to quickly and effectively create parsers. While not the fastest or most efficient method for parsing, it finds it advantages in ease of writing, and ease of understanding.

I’m going to quickly cover the basics of writing a simple recursive descent parser in Python. The example language I will be parsing is boolean formulas, e.g. \((A \land B) \implies B\).

Parsing boolean formulas will allow us to write programs to evaluate our expressions, generate truth tables, convert to normal forms etc.

The syntax we will be using is C-style syntax, e.g. & for boolean and, ! for negation and | for or. We will also use -> to represent implication.

An example expression in our syntax is A -> (B -> ((B & A) & (B | !A))).

Defining our grammar

The first step in creating a recursive descent parser is defining your syntax in terms of a grammar. We will define our syntax using a context free grammar. We will define it in a way that we don’t need any lookahead to keep things simple.

A context free grammar can be thought of as defining our grammar recursively, and the way we define our recursion defines our operator precedence rules.

As an example, here is a context-free grammar for a binary string
BinaryString ::= Digit | Digit BinaryString
Digit ::= 0 | 1
You can read this as a binary string is either a single digit, or a digit followed by another binary string, where the digits are 0 or 1.

I am going to assume a familiarity with context-free grammars from here, there’s a lot of resources to learn more about them if you are unfamiliar.

So to make the grammar for our syntax we want to think about how boolean forumlas are constructed.

First we define our variable literals, we want to use capital Latin characters to stand for variables, so lets add a rule like that to our grammar. Lit ::= [A-Z

Our negation operator “binds tightest” meaning it has the highest precedence, so we add a rule like so
Neg ::= Lit | ! Lit
Which means the Neg rule is either a literal, or a negated literal

Next highest precedence is our & symbol for conjunctions. We add a rule to add these.
Conj ::= Neg | Neg & Conj
This means that a Conj term is either a Neg, or a Neg followed by an & and another Conj.

Continuing this pattern we get
Impl ::= Disj | Disj -> Impl
Disj ::= Conj | Conj ‘|’ Disj
Conj ::= Neg | Neg & Conj
Neg ::= Lit | ! Lit
Lit ::= [A-Z]

This grammar covers any expression and establishes operator precedence.
However sometimes we want to add brackets to force our own operator precedence, so let’s tweak the literal rule to allow this.
Lit ::= [A-Z] | ( Impl )
So a literal is now a variable, or a whole new expression in brackets.

Note that each rule is right recursive, the reason for this will become apparent when we write our parsing functions
i.e. The rule is Conj ::= Neg | Neg & Conj, not Conj ::= Neg | Conj & Neg.
Either way of writing this technically defines the same grammar, but it will become apparent why we write our rules “right recursive” when we start writing our parser.

Writing our parser

Lexer

Before we start our parser we should write a lexer.
A lexers job is to take the input string and break it into tokens that the parser understands.
For example in a parser for Java, the lexer will take the sequence of characters [p,r,i,v,a,t,e] and output a PRIVATE token, to make the parsers job easier.
In our case all we really need to do is ignore whitespace.
We want to support a function to get the next token, Python generators are perfect for this task.
Our lexer is then simply

def lexer(s):
    for c in s:
        if c==" ":
            continue
        yield c
    while True:
        yield '\0'

Syntax tree

We will need a class to represent the nodes of our Syntax Tree.
Each operation has at most 2 parameters, so our tree is a binary tree. we define our AST nodes as follows then

class Node:
    def __init__(self, left, right, name):
        self.left=left
        self.right=right
        self.name=name

At this point we should also define a helper function to print our syntax tree

def pr(node):
    a = "("
    if node.left!=None:
        a += pr(node.left)
    a += " " + node.name + " "
    if node.right != None:
        a += pr(node.right)
    a+=')'
    return a

Parser

We start by defining a new class for our parser

class parser:
    def __init__(self, s):
        self.lex = lexer(s)
        self.current = next(self.lex)

This sets up our lexer and gets our first token
Next we’re going to define some helper functions, that will handle the consumption of tokens for us First an accept function

class parser:
    ...
    def accept(self, c):
        if self.current == c:
            self.current = next(self.lex)
            return True
        return False

This function will consume a token and return True if it is the token we are looking for, otherwise it will return False.
We also define a very similar function expect

class parser:
    ...
    def expect(self, c):
        if self.current == c:
            self.current = next(self.lex)
            return True
        print "Unexpected character", self.current, "expected", c
        return False

This operates almost exactly like accept, except it prints an error if we do not get what we expect

So far all we have written is standard boilerplate for a recursive descent parser, now we start writing the functions for our syntax and we see how easy they are to write.
Lets write our functions in the same order as we defined our grammar.

First our variable literals, we write

class parser:
    ...
    def Lit(self):
        l = self.current
        self.current = next(self.lex)
        if ord(l) < ord('A') or ord(l) > ord('Z'):
            print "Expected a capital letter"
            return None
        return Node(None, None, l)

Amazingly this will be our most complex function. All it does is read the next character, if it’s a capital letter, return the AST node for that, otherwise throw an error. At the moment we will ignore the parentheses part of our syntax, and add it at the end

Next lets do our negation rule. Neg ::= Lit | ! Lit

class parser:
    ...
    def Neg(self):
        if self.accept('!'):
            l = self.Lit()
            if l == None:
                return None
            return Node(None, l, "not")
        return self.Lit()

I’m sure you can see how we are essentially just writing code that exactly matches our recursively defined rules.

This pattern continues with our Conj and Disj functions, which just match the rules
Disj ::= Conj | Conj ‘|’ Disj
Conj ::= Neg | Neg & Conj

class parser:
    ...
    def Disj(self):
        l = self.Conj()
        if self.accept('|'):
            r = self.Disj()
            if r == None:
                return None
            return Node(l, r, "or")
        return l

    def Conj(self):
        l = self.Neg()
        if self.accept('&'):
            r = self.Conj()
            if r == None:
                return None
            return Node(l, r, "and")
        return l

Again it’s apparent how these functions just match their respective recursive rules with some boilerplate error checking.
At this point you can see why our grammar rules are right recursive, as if the first call of our function was a call to itself it was recurse forever.

The only things left to do are to add our Impl rule, and to add the parentheses logic to Lit as follows

class parser:
    ...
    def Impl(self):
        l = self.Disj()
        if self.accept('-'):
            if self.expect('>'):
                r = self.Impl()
                if r != None:
                    return Node(l, r, "implies")
            return None
        return l
        
    def Lit(self):
        if self.accept('('):
            r = self.Impl()
            if self.expect(')'):
                return r
            return None
        ...

And with that, our parser is complete. Lets add some logic to run the parser and give it a test

if __name__ == "__main__":
    p = parser(raw_input())
    tree = p.Impl()
    if tree != None:
        print pr(tree)
    else:
        print "Failed to parse"

Now lets run our parser on a test string

eoin$ python parser.py 
Enter a boolean formula: !A -> B | A  
(( not ( A )) implies (( B ) or ( A )))

As we can see, our parser parsed our expression correctly, with the correct operator precedence.
Success!

The finished parser can be downloaded here

Example application

We can then write a quick script to use our parser to evaluate boolean formulas like so

from parser import parser

valMap = {}
def ev(node):
    if node.name=="not":
        return not ev(node.right)
    if node.name=="and":
        return ev(node.left) and ev(node.right)
    if node.name=="or":
        return ev(node.left) or ev(node.right)
    if node.name=="implies":
        return not ev(node.left) or ev(node.right)
    if node.name not in valMap:
        valMap[node.name] = raw_input("Truth value for %s [T/F]: " % node.name)=='T'
    return valMap[node.name]

p = parser(raw_input("Enter a boolean formula: ")).Impl()
if p != None:
    print ev(p)

Which works perfectly

eoin$ python evaluator.py 
Enter a boolean formula: A -> (B & C)
Truth value for A [T/F]: T
Truth value for B [T/F]: T
Truth value for C [T/F]: T
True