Writing a simple recursive descent parser in Python
Recursive descent parsing is a simple, powerful and expressive way to quickly and effectively create parsers. While not the fastest or most efficient method for parsing, it finds it advantages in ease of writing, and ease of understanding.
I’m going to quickly cover the basics of writing a simple recursive descent parser in Python. The example language I will be parsing is boolean formulas, e.g. \((A \land B) \implies B\).
Parsing boolean formulas will allow us to write programs to evaluate our expressions, generate truth tables, convert to normal forms etc.
The syntax we will be using is C-style syntax, e.g. & for boolean and, ! for negation and | for or. We will also use -> to represent implication.
An example expression in our syntax is A -> (B -> ((B & A) & (B | !A))).
Defining our grammar
The first step in creating a recursive descent parser is defining your syntax in terms of a grammar. We will define our syntax using a context free grammar. We will define it in a way that we don’t need any lookahead to keep things simple.
A context free grammar can be thought of as defining our grammar recursively, and the way we define our recursion defines our operator precedence rules.
As an example, here is a context-free grammar for a binary string
BinaryString ::= Digit | Digit BinaryString
Digit ::= 0 | 1
You can read this as a binary string is either a single digit, or a digit followed by another binary string, where the digits are 0 or 1.
I am going to assume a familiarity with context-free grammars from here, there’s a lot of resources to learn more about them if you are unfamiliar.
So to make the grammar for our syntax we want to think about how boolean forumlas are constructed.
First we define our variable literals, we want to use capital Latin characters to stand for variables, so lets add a rule like that to our grammar. Lit ::= [A-Z
Our negation operator “binds tightest” meaning it has the highest precedence, so we add a rule like so
Neg ::= Lit | ! Lit
Which means the Neg rule is either a literal, or a negated literal
Next highest precedence is our & symbol for conjunctions. We add a rule to add these.
Conj ::= Neg | Neg & Conj
This means that a Conj term is either a Neg, or a Neg followed by an & and another Conj.
Continuing this pattern we get
Impl ::= Disj | Disj -> Impl
Disj ::= Conj | Conj ‘|’ Disj
Conj ::= Neg | Neg & Conj
Neg ::= Lit | ! Lit
Lit ::= [A-Z]
This grammar covers any expression and establishes operator precedence.
However sometimes we want to add brackets to force our own operator precedence, so let’s tweak the literal rule to allow this.
Lit ::= [A-Z] | ( Impl )
So a literal is now a variable, or a whole new expression in brackets.
Note that each rule is right recursive, the reason for this will become apparent when we write our parsing functions
i.e. The rule is Conj ::= Neg | Neg & Conj, not Conj ::= Neg | Conj & Neg.
Either way of writing this technically defines the same grammar, but it will become apparent why we write our rules “right recursive” when we start writing our parser.
Writing our parser
Lexer
Before we start our parser we should write a lexer.
A lexers job is to take the input string and break it into tokens that the parser understands.
For example in a parser for Java, the lexer will take the sequence of characters [p,r,i,v,a,t,e] and output a PRIVATE token, to make the parsers job easier.
In our case all we really need to do is ignore whitespace.
We want to support a function to get the next token, Python generators are perfect for this task.
Our lexer is then simply
def lexer(s):
for c in s:
if c==" ":
continue
yield c
while True:
yield '\0'
Syntax tree
We will need a class to represent the nodes of our Syntax Tree.
Each operation has at most 2 parameters, so our tree is a binary tree. we define our AST nodes as follows then
class Node:
def __init__(self, left, right, name):
self.left=left
self.right=right
self.name=name
At this point we should also define a helper function to print our syntax tree
def pr(node):
a = "("
if node.left!=None:
a += pr(node.left)
a += " " + node.name + " "
if node.right != None:
a += pr(node.right)
a+=')'
return a
Parser
We start by defining a new class for our parser
class parser:
def __init__(self, s):
self.lex = lexer(s)
self.current = next(self.lex)
This sets up our lexer and gets our first token
Next we’re going to define some helper functions, that will handle the consumption of tokens for us
First an accept function
class parser:
...
def accept(self, c):
if self.current == c:
self.current = next(self.lex)
return True
return False
This function will consume a token and return True if it is the token we are looking for, otherwise it will return False.
We also define a very similar function expect
class parser:
...
def expect(self, c):
if self.current == c:
self.current = next(self.lex)
return True
print "Unexpected character", self.current, "expected", c
return False
This operates almost exactly like accept, except it prints an error if we do not get what we expect
So far all we have written is standard boilerplate for a recursive descent parser, now we start writing the functions for our syntax and we see how easy they are to write.
Lets write our functions in the same order as we defined our grammar.
First our variable literals, we write
class parser:
...
def Lit(self):
l = self.current
self.current = next(self.lex)
if ord(l) < ord('A') or ord(l) > ord('Z'):
print "Expected a capital letter"
return None
return Node(None, None, l)
Amazingly this will be our most complex function. All it does is read the next character, if it’s a capital letter, return the AST node for that, otherwise throw an error. At the moment we will ignore the parentheses part of our syntax, and add it at the end
Next lets do our negation rule. Neg ::= Lit | ! Lit
class parser:
...
def Neg(self):
if self.accept('!'):
l = self.Lit()
if l == None:
return None
return Node(None, l, "not")
return self.Lit()
I’m sure you can see how we are essentially just writing code that exactly matches our recursively defined rules.
This pattern continues with our Conj and Disj functions, which just match the rules
Disj ::= Conj | Conj ‘|’ Disj
Conj ::= Neg | Neg & Conj
class parser:
...
def Disj(self):
l = self.Conj()
if self.accept('|'):
r = self.Disj()
if r == None:
return None
return Node(l, r, "or")
return l
def Conj(self):
l = self.Neg()
if self.accept('&'):
r = self.Conj()
if r == None:
return None
return Node(l, r, "and")
return l
Again it’s apparent how these functions just match their respective recursive rules with some boilerplate error checking.
At this point you can see why our grammar rules are right recursive, as if the first call of our function was a call to itself it was recurse forever.
The only things left to do are to add our Impl rule, and to add the parentheses logic to Lit as follows
class parser:
...
def Impl(self):
l = self.Disj()
if self.accept('-'):
if self.expect('>'):
r = self.Impl()
if r != None:
return Node(l, r, "implies")
return None
return l
def Lit(self):
if self.accept('('):
r = self.Impl()
if self.expect(')'):
return r
return None
...
And with that, our parser is complete. Lets add some logic to run the parser and give it a test
if __name__ == "__main__":
p = parser(raw_input())
tree = p.Impl()
if tree != None:
print pr(tree)
else:
print "Failed to parse"
Now lets run our parser on a test string
eoin$ python parser.py
Enter a boolean formula: !A -> B | A
(( not ( A )) implies (( B ) or ( A )))
As we can see, our parser parsed our expression correctly, with the correct operator precedence.
Success!
The finished parser can be downloaded here
Example application
We can then write a quick script to use our parser to evaluate boolean formulas like so
from parser import parser
valMap = {}
def ev(node):
if node.name=="not":
return not ev(node.right)
if node.name=="and":
return ev(node.left) and ev(node.right)
if node.name=="or":
return ev(node.left) or ev(node.right)
if node.name=="implies":
return not ev(node.left) or ev(node.right)
if node.name not in valMap:
valMap[node.name] = raw_input("Truth value for %s [T/F]: " % node.name)=='T'
return valMap[node.name]
p = parser(raw_input("Enter a boolean formula: ")).Impl()
if p != None:
print ev(p)
Which works perfectly
eoin$ python evaluator.py
Enter a boolean formula: A -> (B & C)
Truth value for A [T/F]: T
Truth value for B [T/F]: T
Truth value for C [T/F]: T
True