Longest ordered subsequence

If you are into string algorithms (and typical bioinformatics applications) you will love these two posts:

Yeah, I know it doesn’t look like bioinformatics at first sight, but it is very similar to the kind of string algorithms we use there.

It is great posts with a nice mix of coding and algorithmcs.

72-90=-18

A string algorithms challenge

I am currently teaching a class on string algorithms, and while we have a few mandatory projects in the class, I wanted to give the students a bit of a challenge on an optional project.  A challenge where they can use everything they have learned so far, and compete among themselves to see who can solve a problem most efficiently.

The challenge I give them is described below.

While I will not offer everyone reading this to participate on the challenge (the Beer Prize is for my students), I would love to hear any ideas you have.  If you take up the challenge anyway, I would also love to compare your solution to my students’.  If you beat them, there might be a beer in it after all, if I ever meet you.

String algorithms challenge: Finding k-mers in strings

Finding motives in strings

A motif in a string, is a substring that occurs more frequently than would be expected by chance.  In bioinformatics we are interested in such motives because anything that doesn’t look random potentially has some functional importance.

Finding motives just from strings, using no further biological knowledge, is a string algorithmic computational problem.

The main challenge here is dealing with the typically very large sequences involved (ranging from millions to billions of characters).

For the challenge for our course, we will look at a simple approach to the problem: finding all $$k$$-mers (substrings of length $$k$$) occurring above a certain frequency in a string.

Finding high frequency k-mers

The computational challenge — should you choose to accept it — is as follows:

Given an input string x, an integer $$k$$ and a frequency $$f$$, report all $$k$$-mers that occur with frequency higher than $$f$$.

Expect the length of x to be from a few hundred thousands to a few millions and $$k$$ to be between 5 and 20.

Solving the problem

To participate in the challenge you should device and implement an algorithm for the problem.

You should implement the algorithm in a tool that takes three arguments: a file containing the string x and the numbers $$k$$ and $$f$$:

The program should output each $$k$$-mer occurring at frequency higher than $$f$$, and sorted with the highest frequency $$k$$-mers first.

Email the program — or a path to the executable — to me together with a short description of your algorithm.

The competition is only based on speed.  As long as you output the right $$k$$-mers, all that matters it the wall time of your program, and any trick you can think up can be used.

Approaches

I haven’t solved the problem myself yet, but a few approaches springs to mind.

But remember, anything goes, so you are not limited to any of these!

A table strategy

One approach is of course to scan along x and insert all substrings of length $$k$$ into a table to keep count of the occurrences.

The choice of table could be important for the running time, though.

For a hashing strategy, we can build hash keys “online” by updating the hash key for the string x[$$i$$..$$i+k$$] to a hash key for the string x[$$i+1$$..$$i+1+k$$] in constant time, but a major problem with the hashing strategy is that we probably see way too many $$k$$-mers to get the constant time lookup unless we use much more memory than we have.

We can of course also store the $$k$$-mers in a search tree (either directly or using a hash key).  In that case, consider using splay trees, since these will automatically adjust themselves to the frequency with which we look up keys.

A suffix tree or suffix array approach

If you build a suffix tree for x then you can very efficiently calculate the frequency of all $$k$$-mers.  The memory usage might be a problem, though.
A suffix array might solve that problem.

Compressed tries

If memory is a problem, you can also reduce the suffix tree to just the sub-strings of length $$k$$.

The straightforward approach would take time $$O(k\cdot{}|\mathbf{x}|)$$, but it should be possible to modify one of the suffix tree algorithms to get a better running time.

The prize

Besides the admiration of your peers, there is also a proper prize for the fastest solution.  A beer of your choice at the Friday Bar after the exam on the 20th.

Enjoy!

61-80=-19

First day of the new teaching term…

Today is the first day of this years first teaching term.  We have four a year, of seven weeks, with exams in between.

This term I’m teaching two classes:

Both are classes optional classes for the students, and developed by me together with my co-lecturers (Christian N.S. Pedersen for string algorithms and Carsten Wiuf for systems biology).  That means we have a lot of freedom in how we run the class and what we cover.

String algorithms

String algorithms we have taught three times before, so I expect it to be very little work.  I have lecture notes and slides from earlier, and we don’t plan to make any major changes this year.  We cover the most basics in string algorithms, so even if the text book is getting a bit dated, we don’t have to change the class much.  The basics in string algorithms is essentially just suffix trees (which is the most important data structure to know in this field) and a few classical string search algorithms.

Next year we are thinking about having two string algorithms classes; the basic one and a second one covering more recent topics, e.g. genome assembly for next generation sequencing data.  That would be a bit more work, but I’m planning on running a journal club this year to read up on it.

Systems biology

Systems biology is more of a problem.  We have had the class once before, but didn’t particularly like how we ran it, so we need to come up with a new strategy for this year.  We are meeting this afternoon to discuss it.  The first lecture is on Friday.

The problem is that it is such a large field, so picking some “basic” topics is pretty hard.

26-43=-17

Approximate pattern matching

Tomorrow I’m teaching string algorithms covering approximate pattern matching and the Wu-Manber algorithm.

I’m actually also teaching genome analysis but Mikkel is giving the lecture tomorrow, so I don’t have to worry about that.

The good thing about string algorithms is that I have taught it several times before, so there is very little preparation time. I probably ought to spend some more time for it this time, ’cause I don’t particularly find approximate pattern matching that intersting in this class (it is more interesting in algorithms in bioinformatics, the class that Storm is teaching) so I wanted to replace it with something else this year, but didn’t find the time.

Just for the fun of it, I’ve started using Slideshare to publish my presentations. I also put the slides on the course homepage, of course, but with Slideshare I can put the presentations directly on the web like this:

 

 

Now isn’t that cool?

Whether the slides make sense without someone presenting them, I don’t know. In some sense I hope not, because then I am really wasting my time giving the lectures…

Teaching a course the nth time is so much easier than teaching it the first time

I’ve just finished preparing slides for my lecture in string algorithms tomorrow.It took about 15 minutes, and that was spend on reformatting the slides to the new version of OpenOffice.org Impress (for some reason the text always jumps about a bit when the version of OOo changes).In comparison I spent all Sunday preparing for my lectures in genome analysis, a class I am teaching for the first time this term. Having a clear idea about what to cover in the lecture, and of course having old slides to pick from when preparing the presentation, really speeds up the preparation.I guess one of the reasons I have managed so little research this year is that I have taken on three completely new courses (in addition to two old ones) to teach. Next year I only have one new class to teach and four that I have already taught before, so that isn’t so bad.