Working on Functional Data Structures in R

I’ve gotten started on the next chapter in my Functional Data Structures in R book.

I’ve spent a lot of time playing around with lazy queues, but now I get to heaps. I have already implemented a little there, you can see it in my ralgo package. But there are a lot of data structures I still want to implement for the next chapter.

I think I will also abandon the idea of making a Kindle version of this book. I simply need too much math in it, and the Kindle format doesn’t format it correctly. So, I might only sell it as a PDF or a paperback. If I can sell it to Apress, I will do that. Otherwise, I will look for another publisher.

I don’t think there is anything else on the marked about data structures in R, and I think there should be, so I am having a lot of fun with this book.

Lazy queues

Today I implemented one of the data structures from C. Okasaki, Simple and efficient purely functional queues and deques, J. of Functional Programming, 5(4), 583-592, (1995) in R, and wrote about it in the chapter I am working on for my next book.

This is a queue implementation based on lazy lists that combine list concatenation and reversal, to move elements from a “back” list to a “front” list one step at a time as the queue is modified, through a “rotate” function. It requires that you construct the rotations as lazy operations and in its simplest form, the rotation function could look like this:

rot <- function(front, back, a) {
  if (is_nil(front)) cons(car(back), a)
  else {
    lazy_thunk <- function(lst) function() lst()
    lazy_thunk(cons(car(front), rot(cdr(front), cdr(back), cons(car(back), a))))
  }
}

The function does a combination of concatenation and reversal, that makes sense if you read my two previous posts. The back list is guaranteed to be one element longer than the front list when the function is called, so the base case of the recursion just puts a single element from the back list into an accumulator, and the recursive call puts the first element of the front list at the front of a list that contains the concatenation of the rest of the front list and the reversal of the back list.

Because there is no lazy evaluation of expressions in R, we have to wrap the result in a thunk. Because there is lazy evaluation of expressions, but not the way you would expect, the function doesn’t work. And it was interesting (although also a little frustrating) to figure out why.

The three parameters to the function are passed as promises—unevaluated expressions—and the accumulator is updated by prepending an element to each through a cons call. If this expression is passed along in the recursion as an unevaluated expression, each construction constructs a thunk with another cons call that needs to be evaluated later. Once you get to the end of the recursion and actually do need to access the accumulator, you have to evaluate this expression—so you need to evaluate a list of calls to cons as long as the accumulator is. This will exceed the stack size in R for long lists.

You can get around the problem in different ways. You can explicitly make sure that the cons calls are evaluated immediately

rot <- function(front, back, a) {
  if (is_nil(front)) cons(car(back), a)
  else {
    lazy_thunk <- function(lst) function() lst()
    tail <- cons(car(back), a)
    lazy_thunk(cons(car(front), rot(cdr(front), cdr(back), tail)))
  }
}

or you can just force the accumulator:

rot <- function(front, back, a) {
  force(a)
  if (is_nil(front)) cons(car(back), a)
  else {
    lazy_thunk <- function(lst) function() lst()
    lazy_thunk(cons(car(front), rot(cdr(front), cdr(back), cons(car(back), a))))
  }
}

Usually, it is a good idea to force all parameters if you return them in a closure, so that is the conservative approach and probably what you should aim for whenever you can get away with it.

Anyway, you have to wait a little bit to see how this all fits in with functional queues. I need to implement one more data structure before I’m finished with the current chapter.

More lazy lists

As a follow up to my previous post, I want to look a bit more on my lazy lists in R.

The implementation I showed delays evaluation of some expressions, but not as much delayed as they could be. Well, at least not for list concatenation.

I had these implementations of concatenation and reversal:

reverse <- function(lst) {
  do_reverse <- function(lst) {
    result <- nil
    while (!is_nil(lst)) {
      result <- cons(car(lst), result)
      lst <- cdr(lst)
    }
    result
  }
  force(lst)
  lazy_thunk <- function(lst) {
    function() lst()
  }
  lazy_thunk(do_reverse(lst))
}

cat <- function(l1, l2) {
  do_cat <- function(l1, l2) {
    rev_l1 <- nil
    while (!is_nil(l1)) {
      rev_l1 <- cons(car(l1), rev_l1)
      l1 <- cdr(l1)
    }
    result <- l2
    while (!is_nil(rev_l1)) {
      result <- cons(car(rev_l1), result)
      rev_l1 <- cdr(rev_l1)
    }
    result
  }
  force(l1)
  force(l2)
  lazy_thunk <- function(lst) {
    function() lst()
  }
  lazy_thunk(do_cat(l1, l2))
}

They delay evaluation but only of the first operation following them. For concatenation, we can actually delay evaluation a bit more such that all operations, concatenation and access to the concatenated list, can be done in constant time.

lazy_cat <- function(l1, l2) {
  force(l1)
  force(l2)
  first <- l1()
  if (is.null(first)) l2
  else {
    lazy_thunk <- function(lst) function() lst()
    lazy_thunk(cons(first$car, lazy_cat(first$cdr, l2)))
  }
}

microbenchmark(lst <- cat(l1, reverse(l2)), times = 1) # fast operation
microbenchmark(car(lst), times = 1) # slow operation -- needs to copy l1
microbenchmark(lst <- lazy_cat(l1, l2), times = 1) # fast operation
microbenchmark(car(lst), times = 1) # fast

We can use a similar trick for reversal, but we don’t gain much from it. We can implement lazy reversal as this:

lazy_reverse <- function(lst) {
  rev <- function(l, t) {
    force(l)
    force(t)
    first <- l()
    if (is.null(first)) t
    else {
      lazy_thunk <- function(lst) function() lst()
      lazy_thunk(rev(first$cdr, cons(first$car, t)))
    }
  }
  rev(lst, nil)
}

But the first time we access the head of a reversed list, we will still need to go all the way down the recursion. We cannot get the first element of the reversed list without going to the end of the original list. So in this case, the imperative solution I had before is actually still better—plus, it won’t run out of stack space with too deep recursions, which could easily happen with the lazy version.

Lazy lists in R

Playing around with functional implementations of queues, I want to implement a version with O(1) amortised running time that also works as a persistent data structure, meaning that any earlier version of the queue can be used again. This is, in general, a problem with amortised analysis, that consists of cheap and expensive operations where any sequence of n operations will take time O(n). If such a data structure is used as a persistent data structure, there is nothing preventing you from repeating expensive operations many times, breaking the complexity. It can be achieved for queues, however, by a combination of lazy evaluation and memorisation (C. Okasaki, Simple and efficient purely functional queues and deques, J. of Functional Programming, 5(4), 583-592, 1995).

So, I need to be able to delay evaluation of list operations. R doesn’t do lazy evaluation of expressions, though. It does, however, evaluate function arguments lazily, and this can be exploited to implement delayed evaluation through thunks.

My lazy lists will therefore be implemented as thunks with the invariant that they always evaluate to either NULL—for the empty list—or a list-object with a head and tail (car and cdr in lisp terminology). Functions for constructing and accessing lists look like this:

nil <- function() NULL
cons <- function(car, cdr) {
  force(car)
  force(cdr)
  function() list(car = car, cdr = cdr)
}

is_nil <- function(lst) is.null(lst())
car <- function(lst) lst()$car
cdr <- function(lst) lst()$cdr

The force() calls in the cons function are needed because of how lazy evaluation of function arguments are handled in R, but since cdr is a thunk, it doesn’t evaluate the actual body of the function, it just evaluates the parameter into the underlying function.

Operations on lists can now be delayed by wrapping them in thunks. I can give these thunks an expression that I want to delay the evaluation of. This will be an expression that evaluates to a list, and to make the thunk behave as if it was the list, I just need to evaluate the list in the body of the thunk.

Reversing a list, and concatenating two lists, can thus be implemented like this:

reverse <- function(lst) {
  do_reverse <- function(lst) {
    result <- nil
    while (!is_nil(lst)) {
      result <- cons(car(lst), result)
      lst <- cdr(lst)
    }
    result
  }
  force(lst)
  lazy_thunk <- function(lst) {
    function() lst()
  }
  lazy_thunk(do_reverse(lst))
}

cat <- function(l1, l2) {
  do_cat <- function(l1, l2) {
    rev_l1 <- nil
    while (!is_nil(l1)) {
      rev_l1 <- cons(car(l1), rev_l1)
      l1 <- cdr(l1)
    }
    result <- l2
    while (!is_nil(rev_l1)) {
      result <- cons(car(rev_l1), result)
      rev_l1 <- cdr(rev_l1)
    }
    result
  }
  force(l1)
  force(l2)
  lazy_thunk <- function(lst) {
    function() lst()
  }
  lazy_thunk(do_cat(l1, l2))
}

Again, we need to use force() for the arguments we give to the functions before we use them in the thunk we create—also when we use them in expressions we give to the thunk—because they might be referring to variables that change between the time we call the functions and the time we access the thunk. Do not force the list you give to the thunk, though. That would defy the purpose of making the thunk in the first place—we explicitly do not want the argument evaluated yet.

We can build some lists to test it all out:

vector_to_list <- function(v) {
  lst <- nil
  for (x in v) lst <- cons(x, lst)
  reverse(lst)
}

l1 <- vector_to_list(1:10000)
l2 <- vector_to_list(1:10000)

First, we can try to concatenate two lists. If you want to try this at home, install the microbenchmark package and paste the code into R.

library(microbenchmark)
microbenchmark(lst <- cat(l1, l2), times = 1) # fast operation
microbenchmark(car(lst), times = 1) # slow operation -- needs to copy l1
microbenchmark(car(lst), times = 1) # fast operation
microbenchmark(lst <- cat(l1, reverse(l2)), times = 1) # fast operation
microbenchmark(car(lst), times = 1) # slow operation -- needs to copy l1
microbenchmark(car(lst), times = 1) # fast operation

length <- function(lst) {
  n <- 0
  while (!is_nil(lst)) {
    lst <- cdr(lst)
    n <- n + 1
  }
  n
}

microbenchmark(length(lst), times = 1) # slow operation -- needs to reverse l2
microbenchmark(length(lst), times = 1) # faster operation

This is slower still — we need to both copy l1 and reverse l2:

microbenchmark(length(cat(l1, l2)), times = 1)