A string algorithms challenge
Monday, March 2nd, 2009I am currently teaching a class on string algorithms, and while we have a few mandatory projects in the class, I wanted to give the students a bit of a challenge on an optional project. A challenge where they can use everything they have learned so far, and compete among themselves to see who can solve a problem most efficiently.
The challenge I give them is described below.
While I will not offer everyone reading this to participate on the challenge (the Beer Prize is for my students), I would love to hear any ideas you have. If you take up the challenge anyway, I would also love to compare your solution to my students’. If you beat them, there might be a beer in it after all, if I ever meet you.
String algorithms challenge: Finding k-mers in strings
Finding motives in strings
A motif in a string, is a substring that occurs more frequently than would be expected by chance. In bioinformatics we are interested in such motives because anything that doesn’t look random potentially has some functional importance.
Finding motives just from strings, using no further biological knowledge, is a string algorithmic computational problem.
The main challenge here is dealing with the typically very large sequences involved (ranging from millions to billions of characters).
For the challenge for our course, we will look at a simple approach to the problem: finding all
-mers (substrings of length
) occurring above a certain frequency in a string.
Finding high frequency k-mers
The computational challenge — should you choose to accept it — is as follows:
Given an input string x, an integer
and a frequency
, report all
-mers that occur with frequency higher than
.
Expect the length of x to be from a few hundred thousands to a few millions and
to be between 5 and 20.
Solving the problem
To participate in the challenge you should device and implement an algorithm for the problem.
You should implement the algorithm in a tool that takes three arguments: a file containing the string x and the numbers
and
:
$ program filename k f
The program should output each
-mer occurring at frequency higher than
, and sorted with the highest frequency
-mers first.
Email the program — or a path to the executable — to me together with a short description of your algorithm.
The competition is only based on speed. As long as you output the right
-mers, all that matters it the wall time of your program, and any trick you can think up can be used.
Approaches
I haven’t solved the problem myself yet, but a few approaches springs to mind.
But remember, anything goes, so you are not limited to any of these!
A table strategy
One approach is of course to scan along x and insert all substrings of length
into a table to keep count of the occurrences.
The choice of table could be important for the running time, though.
For a hashing strategy, we can build hash keys “online” by updating the hash key for the string x[
..
] to a hash key for the string x[
..
] in constant time, but a major problem with the hashing strategy is that we probably see way too many
-mers to get the constant time lookup unless we use much more memory than we have.
We can of course also store the
-mers in a search tree (either directly or using a hash key). In that case, consider using splay trees, since these will automatically adjust themselves to the frequency with which we look up keys.
A suffix tree or suffix array approach
If you build a suffix tree for x then you can very efficiently calculate the frequency of all
-mers. The memory usage might be a problem, though.
A suffix array might solve that problem.
Compressed tries
If memory is a problem, you can also reduce the suffix tree to just the sub-strings of length
.
The straightforward approach would take time
, but it should be possible to modify one of the suffix tree algorithms to get a better running time.
The prize
Besides the admiration of your peers, there is also a proper prize for the fastest solution. A beer of your choice at the Friday Bar after the exam on the 20th.
Enjoy!
–
61-80=-19