## Lazy lists in R

Playing around with functional implementations of queues, I want to implement a version with O(1) amortised running time that also works as a persistent data structure, meaning that any earlier version of the queue can be used again. This is, in general, a problem with amortised analysis, that consists of cheap and expensive operations where any sequence of n operations will take time O(n). If such a data structure is used as a persistent data structure, there is nothing preventing you from repeating expensive operations many times, breaking the complexity. It can be achieved for queues, however, by a combination of lazy evaluation and memorisation (C. Okasaki, Simple and efficient purely functional queues and deques, J. of Functional Programming, 5(4), 583-592, 1995).

So, I need to be able to delay evaluation of list operations. R doesn’t do lazy evaluation of expressions, though. It does, however, evaluate function arguments lazily, and this can be exploited to implement delayed evaluation through thunks.

My lazy lists will therefore be implemented as thunks with the invariant that they always evaluate to either NULL—for the empty list—or a list-object with a head and tail (car and cdr in lisp terminology). Functions for constructing and accessing lists look like this:

nil <- function() NULL
cons <- function(car, cdr) {
force(car)
force(cdr)
function() list(car = car, cdr = cdr)
}

is_nil <- function(lst) is.null(lst())
car <- function(lst) lst()$car cdr <- function(lst) lst()$cdr

The force() calls in the cons function are needed because of how lazy evaluation of function arguments are handled in R, but since cdr is a thunk, it doesn’t evaluate the actual body of the function, it just evaluates the parameter into the underlying function.

Operations on lists can now be delayed by wrapping them in thunks. I can give these thunks an expression that I want to delay the evaluation of. This will be an expression that evaluates to a list, and to make the thunk behave as if it was the list, I just need to evaluate the list in the body of the thunk.

Reversing a list, and concatenating two lists, can thus be implemented like this:

reverse <- function(lst) {
do_reverse <- function(lst) {
result <- nil
while (!is_nil(lst)) {
result <- cons(car(lst), result)
lst <- cdr(lst)
}
result
}
force(lst)
lazy_thunk <- function(lst) {
function() lst()
}
lazy_thunk(do_reverse(lst))
}

cat <- function(l1, l2) {
do_cat <- function(l1, l2) {
rev_l1 <- nil
while (!is_nil(l1)) {
rev_l1 <- cons(car(l1), rev_l1)
l1 <- cdr(l1)
}
result <- l2
while (!is_nil(rev_l1)) {
result <- cons(car(rev_l1), result)
rev_l1 <- cdr(rev_l1)
}
result
}
force(l1)
force(l2)
lazy_thunk <- function(lst) {
function() lst()
}
lazy_thunk(do_cat(l1, l2))
}

Again, we need to use force() for the arguments we give to the functions before we use them in the thunk we create—also when we use them in expressions we give to the thunk—because they might be referring to variables that change between the time we call the functions and the time we access the thunk. Do not force the list you give to the thunk, though. That would defy the purpose of making the thunk in the first place—we explicitly do not want the argument evaluated yet.

We can build some lists to test it all out:

vector_to_list <- function(v) {
lst <- nil
for (x in v) lst <- cons(x, lst)
reverse(lst)
}

l1 <- vector_to_list(1:10000)
l2 <- vector_to_list(1:10000)

First, we can try to concatenate two lists. If you want to try this at home, install the microbenchmark package and paste the code into R.

library(microbenchmark)
microbenchmark(lst <- cat(l1, l2), times = 1) # fast operation
microbenchmark(car(lst), times = 1) # slow operation -- needs to copy l1
microbenchmark(car(lst), times = 1) # fast operation
microbenchmark(lst <- cat(l1, reverse(l2)), times = 1) # fast operation
microbenchmark(car(lst), times = 1) # slow operation -- needs to copy l1
microbenchmark(car(lst), times = 1) # fast operation

length <- function(lst) {
n <- 0
while (!is_nil(lst)) {
lst <- cdr(lst)
n <- n + 1
}
n
}

microbenchmark(length(lst), times = 1) # slow operation -- needs to reverse l2
microbenchmark(length(lst), times = 1) # faster operation

This is slower still — we need to both copy l1 and reverse l2:

microbenchmark(length(cat(l1, l2)), times = 1)

## Looking for collaborators

There are a few R packages I just wrote to have examples for my books, but I now think they might be generally useful. I just haven’t taken them to the point where they are. I know I am lazy enough to just leave them as they are, but if I could get collaborators who would like to work on them with me, I think we can make something out of them.

There is the bindr package that makes it just a tad easier to work with functions that return more than a single value.

Then there is the dfdr package that computes the derivative of a function with a bit of meta programming, that I can see greatly helping in numerical optimisation.

And, of course, there is the ralgo package I am working on right now for efficient data structures in R; I implement data structures in it as I write my next R book.

If anyone out there would like to be involved in a little bit of R programming with me, I would love to collaborate.

## Meta-programming in R update

I sent the manuscript for Meta-programming in R to Apress a few days ago. Usually, there is a technical review and then I have to fix a few things before it goes into production. This time around, there were no changes required so it went directly to production. I am looking forward to seeing it in print. Because it is sold now, you cannot get it any longer on Amazon; you just have to wait.

## New chapter in Functional Data Structures in R

Over Easter I’ve been writing on a little book on Todoist together with my good friend Amir, and that has been a lot of fun, but I am back from my holiday today and did a little writing on my next R book, Functional Data Structures in R. I’ve finished another chapter, and you can get the current draft on Gumroad by following the link to the book.

The new chapter is on bags, stacks, and queues, and the reason it isn’t highlighted as drafted in the progress figure above is that I am thinking about adding one more implementation of queues to the chapter.

If you implement queues in a language where you can modify pointers, you can get a very efficient version through a double-linked list, but if you are in a language where data is immutable you have to do a little more work. R isn’t entirely pure, as a functional language, but it does have immutable data, so the implementation I present in the chapter is a mix of a functional and an ephemeral data structure. Strictly speaking, you can implement a double-linked list with a little work in R, but you have to abuse environments to do so, and the solution I present does use environments but is less abusive than a real double-linked list. It relies on an amortised analysis that lets you implement queues as two lists where you have to reverse one to pop from the other when the other is empty. You have to think this way: whenever you add an element to a queue, you take twice as much time as you really do, but then the extra time is put in the bank and it will pay for getting the front element of the list later on. The figure below illustrates it, and you can read the explanation in the new chapter.

Now, a nice property about purely functional data structures is that they are persistent. That means that you always have access to older versions of them after you “modify” them—but this property destroys the amortised analysis you can do on them. When we do amortised analysis we think in terms of a sequence of operations we can do on a data structure and do an analysis that shows that on average they have a certain runtime complexity. There are some cheap operations and some expensive operations, but we can make the cheap operations pay for the expensive operations such that the average complexity is bounded. That works great for ephemeral data structures—that is data structures that are modified whenever we work on them—but if we have persistent data structures—those that we can access earlier versions of, this analysis goes out the window.

For queues, it works like this: you can add to a queue with cheap operations, but some dequeueing operations are costly. If you always have a sequence of operations where the next operation operates on the data structure the resulted from the previous operation, then the amortised analysis is correct. But if you have a persistent data structure, then nothing prevents you from doing a number of cheap operations and then a sequence of expensive operations, going back to the same data structure you had after all the cheap operations. This means that you cannot control the balance between cheap and expensive operations. The amortised analysis breaks.

I give an explanation of that in the new chapter.

If you want a guaranteed complexity, and also persistent data structures, you need to work with worst-case complexity, not amortised complexity.

You can get out of this using lazy evaluation, and I have a paper on how to implement queues this way on my desk right now, but I haven’t implemented it yet. R doesn’t really have lazy evaluation—as with much else in R, it is a half-measure thing: R does have lazy evaluation for function arguments but not evaluation of expressions in general. I am pretty sure I can implement lazy evaluation via some non-standard evaluation where I can delay evaluation via thunks, but I need to think about it a little more before I experiment with it.

I will get there, but I wanted to share the current chapter today and leave that update for a later time. I might be able to implement it tomorrow, but if I don’t, I will be away for a seminar until Thursday, so I might as well share what I have right now.

I would really like to add some thoughts on this to the queue chapter, but it depends a little on how difficult it will be to implement. I suspect that the lazy evaluation solution will be a lot slower than the amortised solution for any non-persistent use of the data structure, and depending on how hard it is to implement I might leave it out of the book. I will definitely try to get it into the R package I am working on in parallel with the book, though. You can get the code for that on GitHub anyway.

For the next chapter, on functional heaps, I have a lot of stuff planned. I have only implemented leftist-heaps in the package so far, but I have a stack of papers of data structures that I look forward to implementing and writing about, so that is going to be a lot of fun. It will probably take me a few weeks to get it all done, though. And the coming weeks are a little busy at work, so that might mean a month or so. Anyway, I will keep you guys updated on the progress.

For the chapter after that—search trees and implementations of sets—I have come up with a neat way of using non-standard evaluation to implement the kind of pattern matching you have in Standard ML and Haskell for R. It is inspired by my bindr package I wrote for my Meta-programming book. You can find it in my ralgo package if you are impatient, but otherwise you can read about it when I get around to explaining it. I might use it already in the heaps chapter to implement splay-heaps. What it really is, is a way of both testing the structure of a tree and testing properties about it, so you can implement tree transformations with little code. I want to experiment with it a little more to get it just right, but I am pretty happy with what I have so far.

I am having a lot of fun working with data structures in R right now, after returning from my Easter vacation, and I can’t wait to implement more. It will have to wait, though, as I would really like to finish the book I’m writing with Amir first and then I have to go for a seminar on how to interact with the media on Tuesday and Wednesday—those of you who know me will agree that this is probably a good idea; I need to know how to combat the fake news, after all (and also learn how to not say controversial things on national television).

Anyway, as always, I would love any comments or criticism of my manuscript so far—and for this particular book where I am doing a lot more research and experiments I would love to collaborate on the data structure package I am working on as well. I am playing with the idea of writing a piece for the R journal about the package once I’m done with the book, and I would love to find collaborators on that.

Enough about the update for that book now, though. On to the next post…

## Evaluating queue data structures

Working on my book on functional data structures in R, I have implemented three versions of functional queues, two that are not really functional—they need to update a data structure as side effects—and one that is. As far as I can tell, they should have roughly the same performance, but they don’t.

Asymptotically, they all work in $$O(1)$$ time for all operations, this nice figure explains why:

The implementations have two lists and need to move elements from one to the other when you access the front of the queue, so you can imagine all enqueue operations costing twice as many computrones as the other operations and then that can pay for the movement of elements later.

However, the three implementations I have in R have different performances when I measure them. Two of them modify an environment and are not persistent; the third remembers one element more than the others and gives me a persistent data structure. For some reason, the last version has a performance that is about twice as bad as the other two.

I am stumped as to why.

You can reproduce the experiments with the code below—there might be a beer in it for you if you can explain to me why the functional version is so much slower.