Update on Functional Data Structures in R

The manuscript for Functional Data Structures in R is coming along nicely. I’ve just finished the description of red-black search trees. Insertion there is simple, but deletion requires a bit more work.

I added some code for visualising trees and some experiments with runtime performance.

Now I want to go back through the previous chapters and update those with more experiments and visualisation, and also include some of the experiment code. I plan to do this before I return to the search tree chapter and write about splay trees.

Nice to see my books around

Rene Thomsen posted this picture on twitter with the text

@ThomasMailund we have a lot of your ‘Beginning Data Science in R’ books available for both current and future data scientists at Scio+

It is great to see that somebody buys them at least. And there’s more there than I ever had myself.

By now, I only have one copy left of the Data Science book and one copy of the Functional Programming book, but still plenty of Metaprogramming and Object-oriented programming.

New chapter in Functional Data Structures in R

Yesterday, I finished the draft of the Heaps chapter of Functional Data Structures in R. It got a little shorter than I had planned, but instead the chapter on Immutable and persistent data got a little longer.

I had planned to include Brodal heaps in this chapter, but ended up not doing it. They are pretty cool in having optimal worst-case complexity, but there isn’t much novel in how you would implement them in R compared to binomial heaps. Instead, I described how skew-binomial numbers can be used as a framework for random access lists. That is one of the ideas that are used in Brodal heaps, but by using it for lists I get to show the trick without repeating too much code compared to what I would need to do for Brodal heaps.

Now I just have one more chapter to write before I start editing. The chapter on Sets, maps and search trees. There, I will include red-black trees—these can be handled very elegantly if you never delete elements, but it is possible to delete elements with some work that I still have to implement—and I will include splay trees. I don’t know exactly what to include beyond that.

Anyway, the current plan is that this will be the last chapter, but if I can think about something more to add, I might. If not, I will finish the draft and then let the book rest a few weeks so I can edit it with fresh eyes. While this book is resting, I will find something else to write on.

Tell me your favourite functional data structures

I’m writing on my next book—Functional Data Structures in R—again now and would love to hear suggestions for data structures to include in the book.

I have, of course, already described linked lists and trees. R doesn’t have built-in linked lists—the lists in R are really more like arrays in how they work, although they are very dynamic—but you can implement linked lists via R lists by just having elements in the list work as pointers. This, naturally, gives you a list implementation of stacks. With a little bit of extra work, and it really doesn’t take much, you can then also get queues with amortised constant time operations.

If you implement lazy evaluation, which you can do with some tricks exploiting that arguments to functions in R are lazy evaluated, you can get worst case constant operations on queues as well.

I’ve described how you can implement skew binary numbers as lists and thus increase and decrease these in constant time, and how this can be used to implement random access lists.

Skew binary numbers can be represented as lists containing the positions of non-zero digits. The digit two, which can only occur at the least significant non-zero position, are represented as a duplicated position cell. By merging or splitting positions, you can increment and decrement in constant time.
Skew binary numbers can be represented as lists containing the positions of non-zero digits. The digit two, which can only occur at the least significant non-zero position, are represented as a duplicated position cell. By merging or splitting positions, you can increment and decrement in constant time.
You can represent lists as fully balanced trees and have lists of these at sizes matching the positions in skew binary numbers. The cons and cdr/tail operations then matches the operations you do to increment and decrement numbers.
You can represent lists as fully balanced trees and have lists of these at sizes matching the positions in skew binary numbers. The cons and cdr/tail operations then matches the operations you do to increment and decrement numbers.
Updating random access lists combines updating lists and trees, and because there are no more than log n trees, and these are fully balanced so have depth at most log n, the operations are logarithmic in time.
Updating random access lists combines updating lists and trees, and because there are no more than log n trees, and these are fully balanced so have depth at most log n, the operations are logarithmic in time.

I have leftist heaps and binomial heaps, and I plan to implement and describe skew binomial heaps and Brodal heaps (skew binomial heaps of heaps) as well.

Binomial heaps have a close correspondence to binary numbers, and updating them behaves much like arithmetic on binary numbers.
Binomial heaps have a close correspondence to binary numbers, and updating them behaves much like arithmetic on binary numbers.

I have described splay heaps and plan to describe splay trees in general for search trees—although here I have to figure out how best to implement a structure that needs to return both values and trees in one operation.

Operations for deleting the smallest element in a splay tree.
Operations for deleting the smallest element in a splay tree.
Transitions needed to compute the partition of splay trees into elements smaller than and larger than a new element to be inserted at the root.
Transitions needed to compute the partition of splay trees into elements smaller than and larger than a new element to be inserted at the root.

For sets and maps, I have an implementation of red-black search trees, although I haven’t implemented the delete operation yet—that operation is a lot more complex than the insertion operation.

Are there any important data structures I am missing and that I should include?

No new ideas

That algorithm I wrote about last week, for exact pattern matching using border arrays? Well, it turns out that Gusfield already described that in 1996 in his book Algorithms on Strings, Trees, and Sequences.

Both Storm and I have definitely read this at some point and must then have forgotten. It is hard to say if we then reinvented the idea or just suffer from cryptomnesia.