I am currently teaching a class on string algorithms, and while we have a few mandatory projects in the class, I wanted to give the students a bit of a challenge on an optional project. A challenge where they can use everything they have learned so far, and compete among themselves to see who can solve a problem most efficiently.

The challenge I give them is described below.

While I will not offer everyone reading this to participate on the challenge (the Beer Prize is for my students), I *would* love to hear any ideas you have. If you take up the challenge anyway, I would also love to compare your solution to my students’. If you beat them, there *might* be a beer in it after all, if I ever meet you.

## String algorithms challenge: Finding k-mers in strings

### Finding motives in strings

A *motif* in a string, is a substring that occurs more frequently than would be expected by chance. In bioinformatics we are interested in such motives because anything that doesn’t look random potentially has some functional importance.

Finding motives just from strings, using no further biological knowledge, is a string algorithmic computational problem.

The main challenge here is dealing with the typically very large sequences involved (ranging from millions to billions of characters).

For the challenge for our course, we will look at a simple approach to the problem: finding all $$k$$-mers (substrings of length $$k$$) occurring above a certain frequency in a string.

### Finding high frequency k-mers

The computational challenge — should you choose to accept it — is as follows:

Given an input string **x**, an integer $$k$$ and a frequency $$f$$, report all $$k$$-mers that occur with frequency higher than $$f$$.

Expect the length of **x** to be from a few hundred thousands to a few millions and $$k$$ to be between 5 and 20.

### Solving the problem

To participate in the challenge you should device and implement an algorithm for the problem.

You should implement the algorithm in a tool that takes three arguments: a file containing the string **x** and the numbers $$k$$ and $$f$$:

The program should output each $$k$$-mer occurring at frequency higher than $$f$$, and sorted with the highest frequency $$k$$-mers first.

Email the program — or a path to the executable — to me together with a short description of your algorithm.

The competition is only based on speed. As long as you output the right $$k$$-mers, all that matters it the wall time of your program, and any trick you can think up can be used.

### Approaches

I haven’t solved the problem myself yet, but a few approaches springs to mind.

But remember, anything goes, so you are not limited to any of these!

#### A table strategy

One approach is of course to scan along **x** and insert all substrings of length $$k$$ into a table to keep count of the occurrences.

The choice of table could be important for the running time, though.

For a hashing strategy, we can build hash keys “online” by updating the hash key for the string **x**[$$i$$..$$i+k$$] to a hash key for the string **x**[$$i+1$$..$$i+1+k$$] in constant time, but a major problem with the hashing strategy is that we probably see way too many $$k$$-mers to get the constant time lookup unless we use much more memory than we have.

We can of course also store the $$k$$-mers in a search tree (either directly or using a hash key). In that case, consider using splay trees, since these will automatically adjust themselves to the frequency with which we look up keys.

#### A suffix tree or suffix array approach

If you build a suffix tree for **x** then you can very efficiently calculate the frequency of all $$k$$-mers. The memory usage might be a problem, though.

A suffix array might solve *that* problem.

#### Compressed tries

If memory is a problem, you can also reduce the suffix tree to just the sub-strings of length $$k$$.

The straightforward approach would take time $$O(k\cdot{}|\mathbf{x}|)$$, but it should be possible to modify one of the suffix tree algorithms to get a better running time.

### The prize

Besides the admiration of your peers, there is also a proper prize for the fastest solution. A beer of your choice at the Friday Bar after the exam on the 20th.

Enjoy!

—

61-80=-19