Archive for January 19th, 2009

This might be worth trying...

Monday, January 19th, 2009

If you've read my earlier posts under the "teaching" category, you might know that I have had my problems activating my students.

Not the class I taught last term, but pretty much every single other class.

Maybe "clickers" is a way to go?

I'm not sure how to implement this in practise, though. I don't have the skills to wire up the electronics for it...

--

19-33=-14

Yet another Newick parser

Monday, January 19th, 2009

Whenever I try out a new framework for writing parsers, I always write a Newick parser.  It is the "Hello, World!" of parsers.

A little while back, I wrote a Newick parser for C++ using Boost.Spirit, and last month I needed to update my Python parser for Newick, so I tried out the Toy Parser Generator framework.

Some students had a problem with my old parser.  It wouldn't accept un-quoted labels with names starting with a number, such as 12b_foobar.  Those labels are, strictly speaking, perfectly valid in the Newick format, but my parser would break the label into two tokens, a number, 12, and a label b_foobar, and there were no easy way to fix that the way the parser is designed.

So I had an opportunity to try out one of the many Python parser frameworks.

I picked the Toy Parser Generator becaused it looked to be very easy to use, and it was.

You write a parser by writing a class that inherits from tpg.Parser, and you then write your tokenizer and parser as the documentation string of this class.  So, to ignore white space and consider every string not containing "^", ",", ":", ";", "(" and ")" a label -- as per the Newick standard -- you write the tokenizer like this:

class Parser(tpg.Parser):
r"""
separator space	'\s+' ;
token label		'[^,:;() \t\n]+' ;
...
"""

Similarly, you write the parser in the documentation string using essentially a BNF grammar, but annotated with the actions you want the parser to take when it has parsed a given part of the input.

For example, if we say that a Newick tree consists of a "sub-tree" followed by ";" where a "sub-tree" can be either a leaf or an inner node, we write:

class Parser(tpg.Parser):
r"""
...

START/tree ->
Subtree/tree ';'
;
Subtree/tree ->
Internal/tree
| Leaf/tree
;

..."""

Here START is the root of the grammar -- and will be the result of the parser call -- while Subtree, Internal and Leaf are non-terminals in the grammar.

The name after the slash in START/tree, Subtree/tree, Internal/tree and Leaf/tree is an identifier used for the actions.

The name after the non-terminal before the "->" will be the result of parsing the sub-expression, so START will result in tree and so will Subtree, where tree is whatever the Python variable tree contains.

On the right hand of the "->" the names after the (non-)terminals will be the result of parsing a sub-expression, so for the START rule, START will return tree that is whatever the Subtree sub-expression returns.

For the Subtree rule, the parser will again return tree, but tree is either the result of parsing an Internal rule or parsing a Leaf rule; the result of both will be named tree (because of the Internal/tree and Leaf/tree).

For a more interesting example, we can look at the way we handle nodes in the tree:

class Parser(tpg.Parser):
r"""
...
Leaf/leaf ->
Name/name                        $leaf = Leaf(name) ; Internal/tree -> '' Name/n$ tree = Internal(n,st)
;
..."""

A Leaf is just a Name, where Name is a rule I've defined elsewhere. A Name will return a value, name, (it will be a string, but that is only because Name is defined that way).  I translate that value into a Leaf using the expression after the $where I set the value for "leaf" which is what the Leaf rule will return. Internal will return a tree -- named "tree" -- and that consists of a list of branches in parenthesis followed by a name (the name of the inner node -- I have actually never seen Newick trees with named inner nodes, but it is in the standard). The list of branches is parsed using the BranchList rule whos result is stored in the "st" variable (st for sub-trees). The Name is stored in the "n" variable, and the result of Internal, "tree" is set in the expression after$.

All in all, it is a very easy way to write a parser.  Easier even that Boost.Spirit -- that is also a very nice framework -- and much easier than Yacc/Lex.

It doesn't help me at all with the problem I had with labels starting with numbers, though.

If you first tokenize and then parse, there is no easy way around that.  The string 12b_foobar really is two tokens, a number 12 and a string b_foobar.  Maybe with a framework that would go for the longest legal token first or something like that would help, but that isn't the case with this framework and I am not sure what other consequences that would have...

Anyway, I solved the problem by not solving it in the framework at all.  I just consider both labels and numbers as "labels" and handle the "labels" that should really be numbers -- the branch lengths -- in the expressions in the parser:

class Parser(tpg.Parser):
r"""
...
Name/name ->
label/name
| Empty               $name = "" ; Length/length -> ':' Number/length | Empty$ length = 0
;

Number/n ->
label/l	            $n = float(l) ; ...""" The complete code for the parser can be downloaded here. -- 19-32=-13 There and back again Monday, January 19th, 2009 We came back from Beijing yesterday evening -- our luggage this afternoon -- and I am still pretty tired after the trip. I had to leave the office early afternoon, when my brain just shut down. So don't expect any quality posts from me for a day or two. In the mean time, I'll just refer you to this list of predictions for 2009. Most I agree with, with the possible exception of this one: We will not see a retail complete genome sequence offered for less than$1000. I'd be happy to be proven wrong here, mind you, but I just can't see prices tumbling this far over the next twelve months - even with the huge competition and rapid technological advances in the DNA sequencing sector. Of course, it depends what you mean by "complete" - it will no doubt be possible to offer a fragmentary, low-coverage genome at this price by the end of the year, but such a product would be almost worse than no information at all. Alternatively, cut-price genome sequences may be offered by companies at a loss, to attract attention and create a more sustainable long-term market.

It sounds very reasonable, but I've been pretty pessimistic in my predictions about genotyping and sequencing technology in the past, so I will choose to be optimistic for 2009 and predict that we will see a \$1000 genome.

--

19-31=-12