Back over Easter Amir Salihefendic and I wrote a guide to Todoist, the todo list manager that his company makes. It took a little longer for us to finally make the guide public—Amir had some of his designers make a better cover than I had hacked together with my quite limited skills and had one of his guys copyedit it first—but now you can get it on Amazon and if you get it before June 4 you can even get it for free.
I’m reading Chris Okasaki’s Purely Functional Data Structures again while I work on my book on data structures in R. I read it a while back, but it is really eyeopening how much I missed when I just read it compared to now that I am actually implementing and writing about these data structures myself.
One thing I didn’t think much about when I read it the first time is how lazy evaluation and amortised analysis really work together—which is a bit surprising considering how much of the book is actually about that. Now, that I have had to think more about it myself in order to explain it in my own book, it finally clicks.
With amortised analysis you can think about saving up some “computation” by having cheap operations on a data structure that can then pay for more expensive operations later on, such that the average cost of operations is below a certain bound. Okasaki makes a distinction between using the “banker’s method” or the “physicist’s method”, but I never really thought much about that and I don’t think too much about it now either. I just think about saving up some “computation” in a data structure that I can use later.
One example of such a saving analysis is the functional queue. With functional lists you cannot really implement queues with pointer manipulation so the structure you have to work with is a single-linked lists where you can prepend elements. This makes it easy to implement stacks but hard to implement queues. What you can do to implement queue is to use two lists, a front queue, F, and a back queue, B. When you enqueue elements, you prepend them to the B list and when you dequeue them you take them from the front of the F list. Of course, at some point you will need to move elements from B to F to make this work, which requires a reversal of B.
No worries, though. If F is empty, you will just reverse B and call the result the new F. Amortised analysis tells you this will be fine and you will get constant time queue operations. The reasoning goes like this: every time you enqueue an element you pay one “computation” for doing that and you save one “computation” in the data structure. The length of B is the same as your total savings. When you reverse B to make the result the new F, the time is proportional to B so you have saved up for that. After you have reversed B your savings are depleted, but no worries. It is not a problem to have no savings; you just can’t go below zero.
This reasoning becomes a problem if you want to use the queue as a persistent data structure. Persistent here means that any earlier state of the data structure is potentially available to continue manipulations on.
If I have saved up enough “computation” for one reversal, and I now have an empty F list and a full B list, and I want to dequeue, then that is fine. The savings pay for the reversal of B. But with a persistent data structure, I could dequeue from the same “empty F, full B” queue several times. I only have savings enough to pay for the first one. The analysis breaks because I am using the same savings several times.
Future users of the queue can all access the savings, but the savings are really gone after the first use.
With lazy evaluation you get a different way to do amortised analysis. You can use debt instead of savings. You don’t think in terms of savings that can be used for expensive operations in the future; you think in terms of expensive operations—that you haven’t evaluated yet—as a debt that is in the data structure. Any future user of the data structure that wants to invoke the expensive operation must first have paid of the debt.
If, instead of reversing B whenever F is empty, we update F to be CAT(F,REV(B)) whenever the size of F equals the size of B, but do so lazily, then, to get to the expensive REV(B) operation, a future user will first have to had dequeued as many elements as it costs to reverse B.
The lazy REV(B) is there in the data structure as an expensive debt, but you cannot access it before you have paid off the debt by dequeuing all the cheap elements in F.
Instead of saving up for expensive operations in the future, we put the responsibility on future users to pay off the debt before they can access the expensive operation.
If we define concatenation as a lazy expression CAT(A,B) = CONS(CAR(A), CAT(CDR(A), B)) then we can concatenate in constant time, so the concatenation operation is cheap. There is not really any way of making the reversal operation cheaper than linear, but it is okay, we pay for it through the dequeueing of cheap operations.
You might object, now, that yes we can construct the queue this way, but nothing prevents us then from doing the concatenation and reversal and then dequeue until just before the reversal and leave the queue in this state. Then, we can restart from that queue and dequeue in many different future timelines.
That is a very good objection, and if we just had lazy evaluation this would be a problem. Of course, all future timelines from the point where we make the reversal will be on a timeline that has already paid off the debt, so in some sense we are okay, but we can’t have shared paying off of a debt and then all living happily on the result. It really only works if we remember the results of evaluating an expression so we don’t have to evaluate it again if we access it again later. If we remember the result of evaluated expressions, then only one of the future time lines will actually pay for it. All other timelines, although they are part of paying of the debt, will not see the expensive operation.
You cannot simply implement this using thunks as lazy evaluation. You do need memorisation as well.
I have implemented this in R that has some version of lazy evaluation, but only as parameter expressions. I could just implement the delayed expressions as thunks, but I have to wrap them in parameters passed to the creator of the thunk. So it is not quite this simple, but if you get it right, it is quite elegant.
You can now preorder Advanced Object-Oriented Programming in R: Statistical Programming for Data Science, Analysis, and Finance on Amazon.
I only just got through the technical reviews this week and it won’t be out until the autumn, but you can preorder it already.
I don’t actually write the book descriptions they put on Amazon, Apress does that, but this is what it says:
Learn how to write object-oriented programs in R and how to construct classes and class hierarchies in the three object-oriented systems available in R. This book gives an introduction to object-oriented programming in the R programming language and shows you how to use and apply R in an object-oriented manner. You will then be able to use this powerful programming style in your own statistical programming projects to write flexible and extendable software. After reading Advanced Object-Oriented Programming in R, you’ll come away with a practical project that you can reuse in your own analytics coding endeavors. You’ll then be able to visualize your data as objects that have state and then manipulate those objects with polymorphic or generic methods. Your projects will benefit from the high degree of flexibility provided by polymorphism, where the choice of concrete method to execute depends on the type of data being manipulated.
I’ve gotten started on the next chapter in my Functional Data Structures in R book.
I’ve spent a lot of time playing around with lazy queues, but now I get to heaps. I have already implemented a little there, you can see it in my ralgo package. But there are a lot of data structures I still want to implement for the next chapter.
I think I will also abandon the idea of making a Kindle version of this book. I simply need too much math in it, and the Kindle format doesn’t format it correctly. So, I might only sell it as a PDF or a paperback. If I can sell it to Apress, I will do that. Otherwise, I will look for another publisher.
I don’t think there is anything else on the marked about data structures in R, and I think there should be, so I am having a lot of fun with this book.
Today I implemented one of the data structures from C. Okasaki, Simple and efficient purely functional queues and deques, J. of Functional Programming, 5(4), 583-592, (1995) in R, and wrote about it in the chapter I am working on for my next book.
This is a queue implementation based on lazy lists that combine list concatenation and reversal, to move elements from a “back” list to a “front” list one step at a time as the queue is modified, through a “rotate” function. It requires that you construct the rotations as lazy operations and in its simplest form, the rotation function could look like this:
The function does a combination of concatenation and reversal, that makes sense if you read my two previous posts. The back list is guaranteed to be one element longer than the front list when the function is called, so the base case of the recursion just puts a single element from the back list into an accumulator, and the recursive call puts the first element of the front list at the front of a list that contains the concatenation of the rest of the front list and the reversal of the back list.
Because there is no lazy evaluation of expressions in R, we have to wrap the result in a thunk. Because there is lazy evaluation of expressions, but not the way you would expect, the function doesn’t work. And it was interesting (although also a little frustrating) to figure out why.
The three parameters to the function are passed as promises—unevaluated expressions—and the accumulator is updated by prepending an element to each through a cons call. If this expression is passed along in the recursion as an unevaluated expression, each construction constructs a thunk with another cons call that needs to be evaluated later. Once you get to the end of the recursion and actually do need to access the accumulator, you have to evaluate this expression—so you need to evaluate a list of calls to cons as long as the accumulator is. This will exceed the stack size in R for long lists.
You can get around the problem in different ways. You can explicitly make sure that the cons calls are evaluated immediately
or you can just force the accumulator:
Usually, it is a good idea to force all parameters if you return them in a closure, so that is the conservative approach and probably what you should aim for whenever you can get away with it.
Anyway, you have to wait a little bit to see how this all fits in with functional queues. I need to implement one more data structure before I’m finished with the current chapter.