Whenever I try out a new framework for writing parsers, I always write a Newick parser. It is the “Hello, World!” of parsers.
A little while back, I wrote a Newick parser for C++ using Boost.Spirit, and last month I needed to update my Python parser for Newick, so I tried out the Toy Parser Generator framework.
Some students had a problem with my old parser. It wouldn’t accept un-quoted labels with names starting with a number, such as 12b_foobar. Those labels are, strictly speaking, perfectly valid in the Newick format, but my parser would break the label into two tokens, a number, 12, and a label b_foobar, and there were no easy way to fix that the way the parser is designed.
So I had an opportunity to try out one of the many Python parser frameworks.
I picked the Toy Parser Generator becaused it looked to be very easy to use, and it was.
You write a parser by writing a class that inherits from tpg.Parser, and you then write your tokenizer and parser as the documentation string of this class. So, to ignore white space and consider every string not containing “^”, “,”, “:”, “;”, “(” and “)” a label — as per the Newick standard — you write the tokenizer like this:
class Parser(tpg.Parser):
r"""
separator space '\s+' ;
token label '[^,:;() \t\n]+' ;
...
"""
Similarly, you write the parser in the documentation string using essentially a BNF grammar, but annotated with the actions you want the parser to take when it has parsed a given part of the input.
For example, if we say that a Newick tree consists of a “sub-tree” followed by “;” where a “sub-tree” can be either a leaf or an inner node, we write:
class Parser(tpg.Parser):
r"""
...
START/tree ->
Subtree/tree ';'
;
Subtree/tree ->
Internal/tree
| Leaf/tree
;
..."""
Here START is the root of the grammar — and will be the result of the parser call — while Subtree, Internal and Leaf are non-terminals in the grammar.
The name after the slash in START/tree, Subtree/tree, Internal/tree and Leaf/tree is an identifier used for the actions.
The name after the non-terminal before the “->” will be the result of parsing the sub-expression, so START will result in tree and so will Subtree, where tree is whatever the Python variable tree contains.
On the right hand of the “->” the names after the (non-)terminals will be the result of parsing a sub-expression, so for the START rule, START will return tree that is whatever the Subtree sub-expression returns.
For the Subtree rule, the parser will again return tree, but tree is either the result of parsing an Internal rule or parsing a Leaf rule; the result of both will be named tree (because of the Internal/tree and Leaf/tree).
For a more interesting example, we can look at the way we handle nodes in the tree:
class Parser(tpg.Parser):
r"""
...
Leaf/leaf ->
Name/name $ leaf = Leaf(name)
;
Internal/tree ->
'\(' BranchList/st '\)' Name/n $ tree = Internal(n,st)
;
..."""
A Leaf is just a Name, where Name is a rule I’ve defined elsewhere. A Name will return a value, name, (it will be a string, but that is only because Name is defined that way). I translate that value into a Leaf using the expression after the $ where I set the value for “leaf” which is what the Leaf rule will return.
Internal will return a tree — named “tree” — and that consists of a list of branches in parenthesis followed by a name (the name of the inner node — I have actually never seen Newick trees with named inner nodes, but it is in the standard). The list of branches is parsed using the BranchList rule whos result is stored in the “st” variable (st for sub-trees). The Name is stored in the “n” variable, and the result of Internal, “tree” is set in the expression after $.
All in all, it is a very easy way to write a parser. Easier even that Boost.Spirit — that is also a very nice framework — and much easier than Yacc/Lex.
It doesn’t help me at all with the problem I had with labels starting with numbers, though.
If you first tokenize and then parse, there is no easy way around that. The string 12b_foobar really is two tokens, a number 12 and a string b_foobar. Maybe with a framework that would go for the longest legal token first or something like that would help, but that isn’t the case with this framework and I am not sure what other consequences that would have…
Anyway, I solved the problem by not solving it in the framework at all. I just consider both labels and numbers as “labels” and handle the “labels” that should really be numbers — the branch lengths — in the expressions in the parser:
class Parser(tpg.Parser):
r"""
...
Name/name ->
label/name
| Empty $ name = ""
;
Length/length ->
':' Number/length
| Empty $ length = 0
;
Number/n ->
label/l $ n = float(l)
;
..."""
The complete code for the parser can be downloaded here.
–
19-32=-13