A string algorithms challenge
I am currently teaching a class on string algorithms, and while we have a few mandatory projects in the class, I wanted to give the students a bit of a challenge on an optional project. A challenge where they can use everything they have learned so far, and compete among themselves to see who can solve a problem most efficiently.
The challenge I give them is described below.
While I will not offer everyone reading this to participate on the challenge (the Beer Prize is for my students), I would love to hear any ideas you have. If you take up the challenge anyway, I would also love to compare your solution to my students’. If you beat them, there might be a beer in it after all, if I ever meet you.
String algorithms challenge: Finding k-mers in strings
Finding motives in strings
A motif in a string, is a substring that occurs more frequently than would be expected by chance. In bioinformatics we are interested in such motives because anything that doesn’t look random potentially has some functional importance.
Finding motives just from strings, using no further biological knowledge, is a string algorithmic computational problem.
The main challenge here is dealing with the typically very large sequences involved (ranging from millions to billions of characters).
For the challenge for our course, we will look at a simple approach to the problem: finding all
-mers (substrings of length
) occurring above a certain frequency in a string.
Finding high frequency k-mers
The computational challenge — should you choose to accept it — is as follows:
Given an input string x, an integer
and a frequency
, report all
-mers that occur with frequency higher than
.
Expect the length of x to be from a few hundred thousands to a few millions and
to be between 5 and 20.
Solving the problem
To participate in the challenge you should device and implement an algorithm for the problem.
You should implement the algorithm in a tool that takes three arguments: a file containing the string x and the numbers
and
:
$ program filename k f
The program should output each
-mer occurring at frequency higher than
, and sorted with the highest frequency
-mers first.
Email the program — or a path to the executable — to me together with a short description of your algorithm.
The competition is only based on speed. As long as you output the right
-mers, all that matters it the wall time of your program, and any trick you can think up can be used.
Approaches
I haven’t solved the problem myself yet, but a few approaches springs to mind.
But remember, anything goes, so you are not limited to any of these!
A table strategy
One approach is of course to scan along x and insert all substrings of length
into a table to keep count of the occurrences.
The choice of table could be important for the running time, though.
For a hashing strategy, we can build hash keys “online” by updating the hash key for the string x[
..
] to a hash key for the string x[
..
] in constant time, but a major problem with the hashing strategy is that we probably see way too many
-mers to get the constant time lookup unless we use much more memory than we have.
We can of course also store the
-mers in a search tree (either directly or using a hash key). In that case, consider using splay trees, since these will automatically adjust themselves to the frequency with which we look up keys.
A suffix tree or suffix array approach
If you build a suffix tree for x then you can very efficiently calculate the frequency of all
-mers. The memory usage might be a problem, though.
A suffix array might solve that problem.
Compressed tries
If memory is a problem, you can also reduce the suffix tree to just the sub-strings of length
.
The straightforward approach would take time
, but it should be possible to modify one of the suffix tree algorithms to get a better running time.
The prize
Besides the admiration of your peers, there is also a proper prize for the fastest solution. A beer of your choice at the Friday Bar after the exam on the 20th.
Enjoy!
–
61-80=-19
March 3rd, 2009 at 11:31 am
Setting your students a problem you haven’t solved! Shame on you ;)
You should specify the expected alphabet the sequence will come from. If it is (1,0) then you probably don’t need that much space for any solution.
March 3rd, 2009 at 1:37 pm
And you should specify the platform where the implementations will be compared (i.e. OS, compiler, available memory and CPU features (number of cores, supported SSE instruction sets)).
March 3rd, 2009 at 3:04 pm
James: What’s the fun in setting a problem where you already know the solution? That’s no fun ;-)
Actually, I do have a number of ideas (and a simple solution so I can test that the output of the programs is correct), I just haven’t spent enough time on it to have a very efficient solution.
The point about the alphabet size is a good one, though, and I have received one solution where that is important. Now I’ve put examples of typical input on the web page where I’ve set the solution.
Asbjørn: I will test the programs on a 2Gb RAM dual core i686 Linux box.
March 3rd, 2009 at 3:28 pm
The Lempel-Ziv-Welch algorithm comes to mind. By having a look at the compression table and determining common substrings one might be able to approach this.
Prefix tries or huffman trees on the other hand seem also quite suitable.
March 3rd, 2009 at 4:36 pm
I hadn’t thought about approaching the problem from a compression perspective; it is not something we have covered in the class. Yeah, it might be a nice approach.
March 3rd, 2009 at 5:24 pm
Why would a Karp Rabin solution fail? A few million characters means at most a few million distinct k-mers. For the sake of generosity, let’s say it’s 10 million. If we keep a 64 bit fingerprint for each k-mer (10 mil < 2^24, 24*2 = 48 < 64, a fingerprint collision is exceedingly unlikely), we’d have 80 MB of fingerprints and 40 MB of 32 bit counts. If we allow a factor of 3 overhead for a hash table, we’d be using 320 MB of memory. It seems totally feasible to me.
March 3rd, 2009 at 5:46 pm
There is not much of a problem with collisions of hash keys, no, but you still need to chuck those hash keys into a table and you don’t want a 2^64 long table, and once you reduce the table size you will get collisions.
That is what I meant by there might being a problem with size using a hash table approach.
Karp-Rabin is what I had in mind when I wrote that it should be possible to compute the hash values of all k-mers in x in time O(|x|) rather than O(|x|k).
That still leaves you with ~|x| hash keys where you need to figure out the frequency of each, though, or am I misunderstanding your suggestion?
March 3rd, 2009 at 6:08 pm
Personally, I’d make two passes through the array:
Pass 1: Create a data structure that allows me to recognize all potential k-grams, with no false negatives, but with potential false positives. If s is a k-gram, it always returns true. If s is not a kgram, with high probability, it returns false. Pass 1 would be O(n) time, and except for stepping through the data, would fit in cache. I would be able to compute whether an element is a potential k-gram in O(1) time, and also, only use data in cache.
Pass 2: Go through the data again, but this time make an exact count of every string that is a potential k-gram, according to the data structure in pass 1. I would keep the count in a hash table. This hash table would be small (most likely, fit in cache), since it would only contain probable k-grams, as determined by pass 1. The pass would be O(n)*O(log m), where n is the size of x, and m is the number of acceptable k-grams, and therefore, a small constant factor smaller than the new hash table. Again, except for the source data x, this pass should fit in memory.
In detail, I’d do the following:
x[0..n] is the source data. f is an integer specifying the minimum number of occurrences of a valid k-gram in x (as opposed to a real number 0.0..1.0 or a percentage 0.0…100.0). I’ll assume f is something that is statistically significant (i.e. not a very small number).
Pass 1:
1) Allocate an array A. This should be sized such that size(A)>>len(x)/f, but it should still be small enough to fit in cache. Initialize A to 0.
2) Step through i=0…len(x)-k.
2a) Compute j=hash(x[i..i+k])
2b) Increment A[j], avoiding overflows. This can either be:
if(A[j]<f) A[j]++;
Or we can define A to have 32 bit data words, so we don’t need to worry about overflows.
At this point, we expect the typical elements of A to be statistically distributed in length, but typically, A[i]~=len(D)/len(A)<=f. Therefore, given a string s, if A[hash[s]]=f, then s may be an acceptable k-gram.
Pass 2:
1) Create a hash table T
2) Step through i=0…len(x)-k.
2a) Let s=x[i..i+k]
2a) Compute j=hash(s)
2b) Is A[j]>=f?
2b yes) If s is in T, T[s]++. If s is not in T, T[s]=1
Extract results:
Step through T, and return all elements where T[s]>f.
The core of optimizing this would consist of:
1) Preloading the data x into cache before I hit it (LWN had a series of articles on memory management that told how to do this). Since we are stepping through x serially, we can make use of the full memory bandwidth, and not have latency issues with cache misses
2) Coming up with good hash functions
3) Playing with appropriate metrics for how to size the array A and the table T
4) Coming up with a good way to use multicore
Of these, (4) is the hardest. We can do pass 1 over two halves of the data independently, and then add the two arrays. We can then, again, do pass 2 over two halves of the data, and add the hash tables.
The key to doing this well is to have a better understanding of the statistics of the data, as well as ballpark numbers for f. If f=1 (all strings are valid), the algorithms will be very different than if f=1000.
March 3rd, 2009 at 6:13 pm
Noticed a typo. Therefore, given a string s, if A[hash[s]]>f, then s may be an acceptable k-gram.
March 3rd, 2009 at 8:51 pm
I implemented the … brute-force solution…. :-)
See here: http://plindenbaum.blogspot.com/2009/03/string-challenge-my-brute-force.html
Pierre
March 3rd, 2009 at 9:01 pm
I’m not a computer scientist, but your problem sounds a lot like this tool we have for determining word frequencies in sequences.
See:
http://athena.bioc.uvic.ca/tools/JFreq
March 3rd, 2009 at 9:03 pm
My own brute force solution, in Python, looks like this:
import sys if len(sys.argv) != 4: print 'Usage:', sys.argv[0], 'infile k f' sys.exit(2) x = open(sys.argv[1]).read() k = int(sys.argv[2]) f = float(sys.argv[3]) count_table = dict() no_k_mers = len(x)-k+1 for i in xrange(0,len(x)-k+1): try: count_table[x[i:i+k]] += 1 except KeyError: count_table[x[i:i+k]] = 1 kmers = [(float(count)/no_k_mers, kmer) for kmer,count in count_table.items()] hits = [(freq,kmer) for freq,kmer in kmers if freq >= f] hits.sort() for freq,kmer in reversed(hits): print kmer, freqMarch 3rd, 2009 at 11:30 pm
Look for Rabin-Karp’s substring search algorithm and friends:
http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
March 4th, 2009 at 12:44 am
I understand this is just a brute-force solution, but still, I’d use something like this:
def substrings(x, k):
for ss in itertools.izip(*[x[i:] for i in xrange(k)]):
yield ss
instead of the for i in xrange(len(x)-k+1) business.
March 4th, 2009 at 1:55 am
In neuroscience, we are doing something similar but with GPUs. At IPDPS in May, I will present, Multi-Dimensional Characterization of Temporal Data Mining on Graphics Processors. The advance program is here: http://ipdps.org/ipdps2009/2009_advance_program.html. Also, we have another paper in submission for SIGKDD that explores the algorithm used for the GPU tests.
The short of all that, is that the data-parallelism of the GPUs can increase performance by orders of magnitude.
March 4th, 2009 at 2:06 am
I think the interesting case is when the output is too large to fit in RAM. it’s pretty simple to solve with a unix pipeline, relying on the “sort” command’s external merge-sort, but is there a faster way?
March 4th, 2009 at 7:49 am
Vineet:
I see how that solution works, but I am not sure I see the benefits of it. It is hardly as readable as the simple slicing, and a few quick experiments shows that it isn’t faster either.
Run:
import timeit import itertools def old_count(x,k): count_table = dict() for i in xrange(0,len(x)-k+1): try: count_table[x[i:i+k]] += 1 except KeyError: count_table[x[i:i+k]] = 1 def new_count(x,k): def substrings(x, k): for ss in itertools.izip(*[x[i:] for i in xrange(k)]): yield ss count_table = dict() for kmer in substrings(x,k): try: count_table[kmer] += 1 except KeyError: count_table[kmer] = 1 for n in xrange(100,1100,100): x = 'abc' * n k = 3 print n, print timeit.Timer('old_count(x,k)', 'from __main__ import old_count,x,k', ).timeit(number=100), print timeit.Timer('new_count(x,k)', 'from __main__ import new_count,x,k', ).timeit(number=100)to compare the two solutions. I get:
where the first column is the length of x, the second column the time for the old solution, and the third column is the time for your solution. The old solution is consistently faster.
Am I misunderstanding your comment?
March 7th, 2009 at 3:32 am
I think Rob’s idea is good.
>There is not much of a problem with collisions of hash keys, no, but you still need to chuck those >hash keys into a table and you don’t want a 2^64 long table, and once you reduce the table size >you will get collisions.
>
>That is what I meant by there might being a problem with size using a hash table approach.
There are two kinds of collisions to worry about. The first collision is among the 64-bit values (“fingerprints”), but Rob is going to ignore this possibility. I would point out that if the length of x is near a billion, then there could be like a billion squared collisions to worry about, which is perhaps a non-negligible fraction of 2^64. The second kind of collision is among smaller-sized hash values that correspond to the location in the <1Gb hash table. Collisions here are not a problem because they could be resolved by linear probing or something fancier. This is where the 3x factor is relevant; at least 2/3 of the hash table will be empty.
March 7th, 2009 at 11:02 am
It is the collision resolution of the second kind I am a bit worried about. I’m not saying it is a major problem — I don’t know, ’cause I haven’t experimented with it at all — but I do worry about it :)
It all depends on how many collisions you get, and how you resolve them. Hashing the strings first, and then resolving collisions in a splay tree might be an efficient solution, but I am not sure it will be much faster than just splay’ing to begin with…
September 9th, 2009 at 7:41 pm
I’m not totally sure, but I think either suffix arrays, or the Burrows-Wheeler transform (as used by TopHat etc.) should help.
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
Part of the problem is making a suffix array or BWT string, but libdivsufsort looks like it would make that pretty easy.
http://code.google.com/p/libdivsufsort/
September 9th, 2009 at 7:46 pm
… oops, I said TopHat, but meant Bowtie (or bwa, or probably several other short-read aligners.)