Over the weekend I finished updating the manuscript for Functional Data Structures in R after the tech reviews and proof-reading from a friend of mine. It is now being processed by Apress and I should get the proofs shortly. After that, it goes to print. I was happy to see that there is already a cover done, and you can pre-order it on Amazon. Probably a good idea if you are interested in it, because the pre-order cost is lower than my other (and shorter) Advanced Statistical Programming books price tags.
The only thing I worry about a bit is the description on Amazon. It says that the book covers implementing data structures in C/C and how to wrap those.
It doesn’t. I do write that this is an option, and a valid one, if you want to write really efficient R packages, but the point I am trying to make in the book is that you can get most of the performance gains you need by using appropriate data structures rather than micro-optimisations, and all the data structures I present are implemented in pure R.
The publisher and I are fixing the description. It is a pure R book, and mostly about purely functional data structures, although there are some data structures using environments to modify states.
I just finished preparing slides for my new computational thinking class, and since I don’t want to start on anything new half an hour before I leave the office, I went on twitter and saw this interesting post. It’s a programming challenge described here, with an imperative solution, and with a functional solution described in the first post.
The imperative solution is rather involved and looks like this:
The functional solution is more elegant but also a bit slower. Broken into three functions, as it is in the blog post referenced above, it looks like this:
Well, if you want to write fast code, your first thought should be: can I come up with a simple algorithm. Later, you can worry about implementing it.
Here, we have a problem that involves figuring out if a wall at any given position has higher walls to the left and the right, and if so, what the smallest wall to the left or right are. So, very easily, we can break it down into collecting a list of the tallest wall seen to the left and to the right of each position. This is something we can compute with the accumulate and accumulate_right functions from purrr. After that, we can solve the problem by mapping over these two vectors and the vector of wall heights, and we have a three line solution to the challenge.
Not only is this solution much simpler, it is also a lot faster.
Update: Ok, if I had been smarter half an hour ago I would have seen that the pmap expression was overkill. A vector expression will work just fine and be faster.
If you are familiar with Python programming you will also be used to list comprehension. It is syntactic sugar you use to construct lists from other lists (or any structure you can iterate through in general). It isn’t actually an invention from Python — you have it in a lot of programming languages — but Python is probably the most used of these, so you are more likely to know it from there.
List comprehension gives you syntactic sugar to write expressions that combine filtering and mapping. You can construct a list by evaluating an expression for each element in another list and combine it with a predicate that must be true for the element you want to include in the result. Using it, you can write algorithms very succinctly. For example, a quick sort in Python can be implemented in a few lines if you use list comprehension:
R doesn’t have list comprehension, but it does have excellent support for meta-programming that will let you add it if you write your own domain specific language — which will be the topic for one of my upcoming books.
I wanted to implement it as a test for what I want to write about, so I played around with it a bit over the weekend, and came up with a solution that lets me, at least, implement quick sort as simply as I can in Python:
If it works, and it does, it works like this:
Here’s a solution I came up with. It isn’t perfect, but it is a starting point for playing around with it:
I use the rlang package for it, rather than working with raw quotes and evals, but you could just as easily do that. I just want to use the excuse of writing the next book to get more familiar with that package and tidy eval, so that is what I use.
This implementation is a bit more general than Python’s list comprehension — I can handle more than one list and more than one predicate. I Python, you might think you can handle more than one list, but you’d be wrong. You can write expressions that involves more than one, but those are just nested versions of list comprehension. You get an outer product of the lists rather than an inner product. I get do that as well by nesting my solution, but I can handle more.
Anyway, let me walk you through my solution.
The lcomp function takes an arbitrary number of arguments, the first of which must be an expression, the rest captured by the “…” special argument. I translate the first argument into a quoted expression using the enquo function. This works like the quote function, in a way, but it first substitute the parameter the user actually wrote into the expression and then quotes it. Similar to if you used the substitute function, but it handles environments correctly so we can evaluate the expression in the context where the user wrote it. The remaining elements, and there can be an arbitrary number of those, are quoted in a similar way using the quos function.
We set the expression aside for a little bit and consider the other arguments. Those will be the lists we should iterate over and the predicates for them.
The way I intend the function to be used, I will require that you name lists — see the “x = …” argument in the quick sort example — but that you provide the predicates as boolean expressions — the third argument in the calls in the quick sort example.
Named arguments will have a name in the quoted list of expressions, the others won’t, so I can use this to distinguish between lists provided to the function and the predicates over them.
I evaluate the lists in the context where the user defined them using the eval_tidy function and I just get the raw quoted predicates for the predicates using the UQE function.
After these statements, I have in “lists” the named values of the lists provided, and in “predicates” I have expressions with no environment associated. The latter might be a problem that I have to deal with at some point, but I’ll get back to that.
Anyway, I have the lists so I should be able to compute the expression in the first argument to lcomp in the context of each element in the lists. This is something that calls for mapping a function over the lists, but first we must create such a function.
It isn’t entirely trivial how we should evaluate the body of this function. Obviously, the body of the function must be the expression provided as the first argument to lcomp, but this expression could refer to both elements in the lists and variables in the calling context.
The way I chose to implement this is this:
The body of the function I set to be the raw expression that was provided to the lcomp function. This looses all information about the context in which the expression was provided, but I handle that by making that the enclosing environment of the function. Anything that isn’t provided as arguments to the function will be available through this environment, then. For the arguments, I just shove in the lists. This makes the function take the right arguments, and it will have the lists as defaults. The expression could be evaluated just from these default arguments if it is a vector expression and all the arguments are vectors — something we could make sure that the quick sort algorithm would provide — but it won’t work if we are combining lists and vectors and the expression is not vectorised, so I map the function over the lists instead.
Now I have all the values for the output list, but I have ignored the predicates. I did that because the predicates are a bit harder to handle.
The way I handle the predicates is similar to how I compute the values, though. I create functions that evaluate each predicate over the list, and then I “and” them together to get a boolean vector I can use to pick out the values I should return.
Getting the values where all predicates are satisfied is now only a question of returning the values where the “keep_index” vector is true:
This really isn’t an ideal solution. I compute values that I don’t need for the final result, and worse, I might end up raising all kinds of exceptions if some of the predicates are preventing me from evaluating an expression on illegal input.
Ideally, I should compute the predicates first and then use that to pick out the elements in the lists I should evaluate the expression on. But by now, there was a bottle of wine on the table and dinner was ready, so I put that aside for a later play session.
I plan to make list comprehension a chapter in my next R book, so keep an eye out if for it if you want to know what I come up with.
I will put up a draft of that book up on Gumroad as soon as I get something written. As always, if you buy it there early, you get it cheap, and you will get updates up until I finish it. I will try to raise enough to pay a copy editor to go through it before I try to sell it — for Functional Data Structures in R I have made exactly $12.75 which won’t pay for many pages of editing (it might pay for salt on the bread but not the bread), so it didn’t happen there, but if you help me out you will get a proof read book. If you send me comments and suggestions, though, I will try to get you a printed copy if I manage to sell the book. I only get ten copies, though, so I can only send a copy to those that really help.
There is definitely still some work to be done, but for now I need to put it aside for a few weeks. I’m hoping to get some feedback on it from some algorithmic people and then make a final version I can send to Apress if they decide to give me an offer.
In the mean time, I’ve started thinking about the next R book. I think I will write about embedded domain specific languages. I already have some ideas for what to include, but not yet enough for a full book, so some more thinking is required.
It will be a lot of meta-programming, but this time around I will base it on tidyeval instead of raw quotes and eval. Tidyeval, in the rlang package, provides a lot of great tools to design and implement domain specific languages, and it will be fun to play around with that.
If I can get my new combination of iA Writer and WordPress to play nice together, I will give you an example in a post very soon.