No new ideas

That algorithm I wrote about last week, for exact pattern matching using border arrays? Well, it turns out that Gusfield already described that in 1996 in his book Algorithms on Strings, Trees, and Sequences.

Both Storm and I have definitely read this at some point and must then have forgotten. It is hard to say if we then reinvented the idea or just suffer from cryptomnesia.

Fast admixture analysis and population tree estimation for SNP and NGS data

Jade Yu Cheng, Thomas Mailund, and Rasmus Nielsen

Bioinformatics (2017)

Motivation: Structure methods are highly used population genetic methods for classifying individuals in a sample fractionally into discrete ancestry components.
Contribution: We introduce a new optimization algorithm for the classical STRUCTURE model in a maximum likelihood framework. Using analyses of real data we show that the new method finds solutions with higher likelihoods than the state-of-the-art method in the same computational time. The optimization algorithm is also applicable to models based on genotype likelihoods, that can account for the uncertainty in genotype-calling associated with Next Generation Sequencing (NGS) data. We also present a new method for estimating population trees from ancestry components using a Gaussian approximation. Using coalescence simulations of diverging populations, we explore the adequacy of the STRUCTURE-style models and the Gaussian assumption for identifying ancestry components correctly and for inferring the correct tree. In most cases, ancestry components are inferred correctly, although sample sizes and times since admixture can influence the results. We show that the popular Gaussian approximation tends to perform poorly under extreme divergence scenarios e.g. with very long branch lengths, but the topologies of the population trees are accurately inferred in all scenarios explored. The new methods are implemented together with appropriate visualization tools in the software package Ohana.

Linear time exact pattern matching

Yesterday, Storm and I had exams in String Algorithms. During the exams we came up with a simple linear time exact pattern matching algorithm that is very similar to the Knuth–Morris–Pratt algorithm but a little bit simpler.

The Knuth-Morris-Pratt algorithm is based on border arrays. A border of a string \(x\) is any substring that is both a prefix and a postfix of \(x\). The border array for \(x\) is an array that for each index \(i\) holds the length of the longest border for prefix \(x[1\ldots i]\). You can compute the border array for a string “key” using the algorithm below:

    unsigned long m = strlen(key);
    unsigned long ba[m];
    ba[0] = 0;
    for (unsigned long i = 1; i < m; ++i) {
        unsigned long b = ba[i-1];
        while (b > 0 && key[i] != key[b])
            b = ba[b-1];
        ba[i] = (key[i] == key[b]) ? b + 1 : 0;
    }

There is a double loop, but you can show that the total number of iterations in the inner loop never exceeds the total number of iterations of the other loop, so the algorithm computes the border array in time \(O(m)\) where \(m\) is the length of the string.

The Knuth-Morris-Pratt algorithm computes the border array for the key it searches for and then has a loop through the string it searches in where it uses the border array to step forward when it sees a mismatch. You can, however, also search in linear time just using border arrays. This is something Storm has used as an exercises in our class for years now. The idea is this, if you want to search for pattern \(p\) in string \(x\), then you can construct the string \(p\#x\) where # is a sentinel symbol that is not found in \(p\) and \(x\). If you build the border array for this string, you can find all occurrences of \(p\) in \(x\) by finding the indices \(i>m\) where the border array contains \(m\) and then report \(i-m+1\).

What we realised yesterday while asking questions at the exam was that you never actually use the border array beyond the first \(m\) entries in this algorithm. You just need to know how long a border would be at any point, but it can never grow beyond \(m\) by construction, so there is no need to store it. Instead, we can just build the border array for the key we search for and then pretend that we build the border array for the rest, but really just keep track of the length of it.

A complete search function based on this can be written in a few lines of C:

static void ba_search(char * key, char * buffer)
{
    unsigned long n = strlen(buffer);
    unsigned long m = strlen(key);
    unsigned long ba[m];
    ba[0] = 0;
    for (unsigned long i = 1; i < m; ++i) {
        unsigned long b = ba[i-1];
        while (b > 0 && key[i] != key[b])
            b = ba[b-1];
        ba[i] = (key[i] == key[b]) ? b + 1 : 0;
    }
    unsigned long b = 0;
    for (unsigned long i = 0; i < n; ++i) {
        while (b > 0 && buffer[i] != key[b])
            b = ba[b-1];
        b = (buffer[i] == key[b]) ? b + 1 : 0;
        if (b == m)
            printf("%lu\n", i - m + 1);
    }
}

The algorithm is essentially just Knuth-Morris-Pratt. It uses the border array for the key in exactly the same way. It is just slightly simpler because it doesn’t consider the search part of the algorithm as different from the preprocessing. Both the loops in this function are essentially doing the same thing—the only reason they are different is that one builds the border array and only looks at the key while the other only keeps track of border lengths and looks at the string we search in.

Lazy queues

Today I implemented one of the data structures from C. Okasaki, Simple and efficient purely functional queues and deques, J. of Functional Programming, 5(4), 583-592, (1995) in R, and wrote about it in the chapter I am working on for my next book.

This is a queue implementation based on lazy lists that combine list concatenation and reversal, to move elements from a “back” list to a “front” list one step at a time as the queue is modified, through a “rotate” function. It requires that you construct the rotations as lazy operations and in its simplest form, the rotation function could look like this:

rot <- function(front, back, a) {
  if (is_nil(front)) cons(car(back), a)
  else {
    lazy_thunk <- function(lst) function() lst()
    lazy_thunk(cons(car(front), rot(cdr(front), cdr(back), cons(car(back), a))))
  }
}

The function does a combination of concatenation and reversal, that makes sense if you read my two previous posts. The back list is guaranteed to be one element longer than the front list when the function is called, so the base case of the recursion just puts a single element from the back list into an accumulator, and the recursive call puts the first element of the front list at the front of a list that contains the concatenation of the rest of the front list and the reversal of the back list.

Because there is no lazy evaluation of expressions in R, we have to wrap the result in a thunk. Because there is lazy evaluation of expressions, but not the way you would expect, the function doesn’t work. And it was interesting (although also a little frustrating) to figure out why.

The three parameters to the function are passed as promises—unevaluated expressions—and the accumulator is updated by prepending an element to each through a cons call. If this expression is passed along in the recursion as an unevaluated expression, each construction constructs a thunk with another cons call that needs to be evaluated later. Once you get to the end of the recursion and actually do need to access the accumulator, you have to evaluate this expression—so you need to evaluate a list of calls to cons as long as the accumulator is. This will exceed the stack size in R for long lists.

You can get around the problem in different ways. You can explicitly make sure that the cons calls are evaluated immediately

rot <- function(front, back, a) {
  if (is_nil(front)) cons(car(back), a)
  else {
    lazy_thunk <- function(lst) function() lst()
    tail <- cons(car(back), a)
    lazy_thunk(cons(car(front), rot(cdr(front), cdr(back), tail)))
  }
}

or you can just force the accumulator:

rot <- function(front, back, a) {
  force(a)
  if (is_nil(front)) cons(car(back), a)
  else {
    lazy_thunk <- function(lst) function() lst()
    lazy_thunk(cons(car(front), rot(cdr(front), cdr(back), cons(car(back), a))))
  }
}

Usually, it is a good idea to force all parameters if you return them in a closure, so that is the conservative approach and probably what you should aim for whenever you can get away with it.

Anyway, you have to wait a little bit to see how this all fits in with functional queues. I need to implement one more data structure before I’m finished with the current chapter.

More lazy lists

As a follow up to my previous post, I want to look a bit more on my lazy lists in R.

The implementation I showed delays evaluation of some expressions, but not as much delayed as they could be. Well, at least not for list concatenation.

I had these implementations of concatenation and reversal:

reverse <- function(lst) {
  do_reverse <- function(lst) {
    result <- nil
    while (!is_nil(lst)) {
      result <- cons(car(lst), result)
      lst <- cdr(lst)
    }
    result
  }
  force(lst)
  lazy_thunk <- function(lst) {
    function() lst()
  }
  lazy_thunk(do_reverse(lst))
}

cat <- function(l1, l2) {
  do_cat <- function(l1, l2) {
    rev_l1 <- nil
    while (!is_nil(l1)) {
      rev_l1 <- cons(car(l1), rev_l1)
      l1 <- cdr(l1)
    }
    result <- l2
    while (!is_nil(rev_l1)) {
      result <- cons(car(rev_l1), result)
      rev_l1 <- cdr(rev_l1)
    }
    result
  }
  force(l1)
  force(l2)
  lazy_thunk <- function(lst) {
    function() lst()
  }
  lazy_thunk(do_cat(l1, l2))
}

They delay evaluation but only of the first operation following them. For concatenation, we can actually delay evaluation a bit more such that all operations, concatenation and access to the concatenated list, can be done in constant time.

lazy_cat <- function(l1, l2) {
  force(l1)
  force(l2)
  first <- l1()
  if (is.null(first)) l2
  else {
    lazy_thunk <- function(lst) function() lst()
    lazy_thunk(cons(first$car, lazy_cat(first$cdr, l2)))
  }
}

microbenchmark(lst <- cat(l1, reverse(l2)), times = 1) # fast operation
microbenchmark(car(lst), times = 1) # slow operation -- needs to copy l1
microbenchmark(lst <- lazy_cat(l1, l2), times = 1) # fast operation
microbenchmark(car(lst), times = 1) # fast

We can use a similar trick for reversal, but we don’t gain much from it. We can implement lazy reversal as this:

lazy_reverse <- function(lst) {
  rev <- function(l, t) {
    force(l)
    force(t)
    first <- l()
    if (is.null(first)) t
    else {
      lazy_thunk <- function(lst) function() lst()
      lazy_thunk(rev(first$cdr, cons(first$car, t)))
    }
  }
  rev(lst, nil)
}

But the first time we access the head of a reversed list, we will still need to go all the way down the recursion. We cannot get the first element of the reversed list without going to the end of the original list. So in this case, the imperative solution I had before is actually still better—plus, it won’t run out of stack space with too deep recursions, which could easily happen with the lazy version.