New chapter in Functional Data Structures in R

Yesterday, I finished the draft of the Heaps chapter of Functional Data Structures in R. It got a little shorter than I had planned, but instead the chapter on Immutable and persistent data got a little longer.

I had planned to include Brodal heaps in this chapter, but ended up not doing it. They are pretty cool in having optimal worst-case complexity, but there isn’t much novel in how you would implement them in R compared to binomial heaps. Instead, I described how skew-binomial numbers can be used as a framework for random access lists. That is one of the ideas that are used in Brodal heaps, but by using it for lists I get to show the trick without repeating too much code compared to what I would need to do for Brodal heaps.

Now I just have one more chapter to write before I start editing. The chapter on Sets, maps and search trees. There, I will include red-black trees—these can be handled very elegantly if you never delete elements, but it is possible to delete elements with some work that I still have to implement—and I will include splay trees. I don’t know exactly what to include beyond that.

Anyway, the current plan is that this will be the last chapter, but if I can think about something more to add, I might. If not, I will finish the draft and then let the book rest a few weeks so I can edit it with fresh eyes. While this book is resting, I will find something else to write on.

Tell me your favourite functional data structures

I’m writing on my next book—Functional Data Structures in R—again now and would love to hear suggestions for data structures to include in the book.

I have, of course, already described linked lists and trees. R doesn’t have built-in linked lists—the lists in R are really more like arrays in how they work, although they are very dynamic—but you can implement linked lists via R lists by just having elements in the list work as pointers. This, naturally, gives you a list implementation of stacks. With a little bit of extra work, and it really doesn’t take much, you can then also get queues with amortised constant time operations.

If you implement lazy evaluation, which you can do with some tricks exploiting that arguments to functions in R are lazy evaluated, you can get worst case constant operations on queues as well.

I’ve described how you can implement skew binary numbers as lists and thus increase and decrease these in constant time, and how this can be used to implement random access lists.

Skew binary numbers can be represented as lists containing the positions of non-zero digits. The digit two, which can only occur at the least significant non-zero position, are represented as a duplicated position cell. By merging or splitting positions, you can increment and decrement in constant time.
Skew binary numbers can be represented as lists containing the positions of non-zero digits. The digit two, which can only occur at the least significant non-zero position, are represented as a duplicated position cell. By merging or splitting positions, you can increment and decrement in constant time.
You can represent lists as fully balanced trees and have lists of these at sizes matching the positions in skew binary numbers. The cons and cdr/tail operations then matches the operations you do to increment and decrement numbers.
You can represent lists as fully balanced trees and have lists of these at sizes matching the positions in skew binary numbers. The cons and cdr/tail operations then matches the operations you do to increment and decrement numbers.
Updating random access lists combines updating lists and trees, and because there are no more than log n trees, and these are fully balanced so have depth at most log n, the operations are logarithmic in time.
Updating random access lists combines updating lists and trees, and because there are no more than log n trees, and these are fully balanced so have depth at most log n, the operations are logarithmic in time.

I have leftist heaps and binomial heaps, and I plan to implement and describe skew binomial heaps and Brodal heaps (skew binomial heaps of heaps) as well.

Binomial heaps have a close correspondence to binary numbers, and updating them behaves much like arithmetic on binary numbers.
Binomial heaps have a close correspondence to binary numbers, and updating them behaves much like arithmetic on binary numbers.

I have described splay heaps and plan to describe splay trees in general for search trees—although here I have to figure out how best to implement a structure that needs to return both values and trees in one operation.

Operations for deleting the smallest element in a splay tree.
Operations for deleting the smallest element in a splay tree.
Transitions needed to compute the partition of splay trees into elements smaller than and larger than a new element to be inserted at the root.
Transitions needed to compute the partition of splay trees into elements smaller than and larger than a new element to be inserted at the root.

For sets and maps, I have an implementation of red-black search trees, although I haven’t implemented the delete operation yet—that operation is a lot more complex than the insertion operation.

Are there any important data structures I am missing and that I should include?

No new ideas

That algorithm I wrote about last week, for exact pattern matching using border arrays? Well, it turns out that Gusfield already described that in 1996 in his book Algorithms on Strings, Trees, and Sequences.

Both Storm and I have definitely read this at some point and must then have forgotten. It is hard to say if we then reinvented the idea or just suffer from cryptomnesia.

Fast admixture analysis and population tree estimation for SNP and NGS data

Jade Yu Cheng, Thomas Mailund, and Rasmus Nielsen

Bioinformatics (2017)

Motivation: Structure methods are highly used population genetic methods for classifying individuals in a sample fractionally into discrete ancestry components.
Contribution: We introduce a new optimization algorithm for the classical STRUCTURE model in a maximum likelihood framework. Using analyses of real data we show that the new method finds solutions with higher likelihoods than the state-of-the-art method in the same computational time. The optimization algorithm is also applicable to models based on genotype likelihoods, that can account for the uncertainty in genotype-calling associated with Next Generation Sequencing (NGS) data. We also present a new method for estimating population trees from ancestry components using a Gaussian approximation. Using coalescence simulations of diverging populations, we explore the adequacy of the STRUCTURE-style models and the Gaussian assumption for identifying ancestry components correctly and for inferring the correct tree. In most cases, ancestry components are inferred correctly, although sample sizes and times since admixture can influence the results. We show that the popular Gaussian approximation tends to perform poorly under extreme divergence scenarios e.g. with very long branch lengths, but the topologies of the population trees are accurately inferred in all scenarios explored. The new methods are implemented together with appropriate visualization tools in the software package Ohana.

KMP for comparison

For comparison with the previous post, this is how KMP would look like:

static void kmp_search(char * key, char * buffer)
{
    unsigned long n = strlen(buffer);
    unsigned long m = strlen(key);
    unsigned long ba[m];
    ba[0] = 0;
    for (unsigned long i = 1; i < m; ++i) {
        unsigned long b = ba[i-1];
        while (b > 0 && key[i] != key[b])
            b = ba[b-1];
        ba[i] = (key[i] == key[b]) ? b + 1 : 0;
    }
    unsigned long i, j; i = j = 0;
    while (i < n - m + j) {
        while (buffer[i] == key[j] && j < m) {
            ++i; ++j;
        }
        if (j == m) printf("%lu\n", i - m);
        if (j == 0) ++i;
        else j = ba[j-1];
    }
}

This algorithm is doing essentially the same thing as the border-array based algorithm, but the two loops look very different and you need to get both of the right. The previous algorithm also has two loops, but they are doing the same thing, so if you get one of the right you will very easily get the other right as well.