Thinking more about Domain Specific Languages in R

I got started on a new book, on domain-specific languages in R, a little while back and also got started on the first proper chapter. Then I had to leave it for a bit to work on another project, but while it’s been simmering I’ve started to have second thoughts about its structure.

What I had in mind when I started was another short book where I would explain a little bit of meta-programming and a little bit about operator overloading and show how you could combine this to write small embedded languages in R. However, I’m now halfway through the first chapter that was supposed to be about how to use classes and operator overloading to do a bit of this, and it turned out to be all about proper language parsing and syntax tree manipulation, so now I’m thinking that I might as well do a proper job of it.

I’m not entirely sure how to structure the book, now, but I think I want some of the easy stuff in there still. Maybe start out with a chapter that shows this. I have a small example where I use delayed evaluation and some rewriting to optimise matrix multiplication and I could use that as a motivating example. Then the next chapter could go into some theory about parsing and evaluating expressions such to motivate the rest of the book.

I had a mental distinction in mind where I would distinguish between standard evaluation and non-standard evaluation, but I’m not sure that is the right way to split the book now. Instead, if I am going to treat books properly, I will make the distinction between parsing an embedded language and evaluating it. The evaluation part can then have some (simpler) standard evaluation parts and some (perhaps harder) non-standard evaluation parts.

I’ll be sitting on a train for an hour and a half today, so I have time to think about that.

Functional Data Structures in R is getting closer to publication

Over the weekend I finished updating the manuscript for Functional Data Structures in R after the tech reviews and proof-reading from a friend of mine. It is now being processed by Apress and I should get the proofs shortly. After that, it goes to print. I was happy to see that there is already a cover done, and you can pre-order it on Amazon. Probably a good idea if you are interested in it, because the pre-order cost is lower than my other (and shorter) Advanced Statistical Programming books price tags.

The only thing I worry about a bit is the description on Amazon. It says that the book covers implementing data structures in C/C and how to wrap those.

It doesn’t. I do write that this is an option, and a valid one, if you want to write really efficient R packages, but the point I am trying to make in the book is that you can get most of the performance gains you need by using appropriate data structures rather than micro-optimisations, and all the data structures I present are implemented in pure R.

The publisher and I are fixing the description. It is a pure R book, and mostly about purely functional data structures, although there are some data structures using environments to modify states.

List comprehension in R

First of all, my apologies for showing code examples as screenshots in this post. I couldn’t get Medium and WordPress to show the code correctly. If you want to copy the actual code to play with it, you can get it here: https://github.com/mailund/dsl/blob/master/R/lcomp/lcomp.R

Right! To it!

If you are familiar with Python programming you will also be used to list comprehension. It is syntactic sugar you use to construct lists from other lists (or any structure you can iterate through in general). It isn’t actually an invention from Python — you have it in a lot of programming languages — but Python is probably the most used of these, so you are more likely to know it from there.

List comprehension gives you syntactic sugar to write expressions that combine filtering and mapping. You can construct a list by evaluating an expression for each element in another list and combine it with a predicate that must be true for the element you want to include in the result. Using it, you can write algorithms very succinctly. For example, a quick sort in Python can be implemented in a few lines if you use list comprehension:

R doesn’t have list comprehension, but it does have excellent support for meta-programming that will let you add it if you write your own domain specific language — which will be the topic for one of my upcoming books.

I wanted to implement it as a test for what I want to write about, so I played around with it a bit over the weekend, and came up with a solution that lets me, at least, implement quick sort as simply as I can in Python:

If it works, and it does, it works like this:

Here’s a solution I came up with. It isn’t perfect, but it is a starting point for playing around with it:

I use the rlang package for it, rather than working with raw quotes and evals, but you could just as easily do that. I just want to use the excuse of writing the next book to get more familiar with that package and tidy eval, so that is what I use.

This implementation is a bit more general than Python’s list comprehension — I can handle more than one list and more than one predicate. I Python, you might think you can handle more than one list, but you’d be wrong. You can write expressions that involves more than one, but those are just nested versions of list comprehension. You get an outer product of the lists rather than an inner product. I get do that as well by nesting my solution, but I can handle more.

Anyway, let me walk you through my solution.

The lcomp function takes an arbitrary number of arguments, the first of which must be an expression, the rest captured by the “…” special argument. I translate the first argument into a quoted expression using the enquo function. This works like the quote function, in a way, but it first substitute the parameter the user actually wrote into the expression and then quotes it. Similar to if you used the substitute function, but it handles environments correctly so we can evaluate the expression in the context where the user wrote it. The remaining elements, and there can be an arbitrary number of those, are quoted in a similar way using the quos function.

We set the expression aside for a little bit and consider the other arguments. Those will be the lists we should iterate over and the predicates for them.

The way I intend the function to be used, I will require that you name lists — see the “x = …” argument in the quick sort example — but that you provide the predicates as boolean expressions — the third argument in the calls in the quick sort example.

Named arguments will have a name in the quoted list of expressions, the others won’t, so I can use this to distinguish between lists provided to the function and the predicates over them.

I evaluate the lists in the context where the user defined them using the eval_tidy function and I just get the raw quoted predicates for the predicates using the UQE function.

After these statements, I have in “lists” the named values of the lists provided, and in “predicates” I have expressions with no environment associated. The latter might be a problem that I have to deal with at some point, but I’ll get back to that.

Anyway, I have the lists so I should be able to compute the expression in the first argument to lcomp in the context of each element in the lists. This is something that calls for mapping a function over the lists, but first we must create such a function.

It isn’t entirely trivial how we should evaluate the body of this function. Obviously, the body of the function must be the expression provided as the first argument to lcomp, but this expression could refer to both elements in the lists and variables in the calling context.

The way I chose to implement this is this:

The body of the function I set to be the raw expression that was provided to the lcomp function. This looses all information about the context in which the expression was provided, but I handle that by making that the enclosing environment of the function. Anything that isn’t provided as arguments to the function will be available through this environment, then. For the arguments, I just shove in the lists. This makes the function take the right arguments, and it will have the lists as defaults. The expression could be evaluated just from these default arguments if it is a vector expression and all the arguments are vectors — something we could make sure that the quick sort algorithm would provide — but it won’t work if we are combining lists and vectors and the expression is not vectorised, so I map the function over the lists instead.

Now I have all the values for the output list, but I have ignored the predicates. I did that because the predicates are a bit harder to handle.

The way I handle the predicates is similar to how I compute the values, though. I create functions that evaluate each predicate over the list, and then I “and” them together to get a boolean vector I can use to pick out the values I should return.

Getting the values where all predicates are satisfied is now only a question of returning the values where the “keep_index” vector is true:

This really isn’t an ideal solution. I compute values that I don’t need for the final result, and worse, I might end up raising all kinds of exceptions if some of the predicates are preventing me from evaluating an expression on illegal input.

Ideally, I should compute the predicates first and then use that to pick out the elements in the lists I should evaluate the expression on. But by now, there was a bottle of wine on the table and dinner was ready, so I put that aside for a later play session.

I plan to make list comprehension a chapter in my next R book, so keep an eye out if for it if you want to know what I come up with.

I will put up a draft of that book up on Gumroad as soon as I get something written. As always, if you buy it there early, you get it cheap, and you will get updates up until I finish it. I will try to raise enough to pay a copy editor to go through it before I try to sell it — for Functional Data Structures in R I have made exactly $12.75 which won’t pay for many pages of editing (it might pay for salt on the bread but not the bread), so it didn’t happen there, but if you help me out you will get a proof read book. If you send me comments and suggestions, though, I will try to get you a printed copy if I manage to sell the book. I only get ten copies, though, so I can only send a copy to those that really help.

Update on Functional Data Structures in R

I’ve gone through the entire manuscript of Functional Data Structures in R now and edited it.

There is definitely still some work to be done, but for now I need to put it aside for a few weeks. I’m hoping to get some feedback on it from some algorithmic people and then make a final version I can send to Apress if they decide to give me an offer.

In the mean time, I’ve started thinking about the next R book. I think I will write about embedded domain specific languages. I already have some ideas for what to include, but not yet enough for a full book, so some more thinking is required.

It will be a lot of meta-programming, but this time around I will base it on tidyeval instead of raw quotes and eval. Tidyeval, in the rlang package, provides a lot of great tools to design and implement domain specific languages, and it will be fun to play around with that.

If I can get my new combination of iA Writer and WordPress to play nice together, I will give you an example in a post very soon.

I might have sold Functional Data Structures in R already

I got an email last night from Apress. I had sent them a few chapters of Functional Data Structures in R back in spring, but back there we agreed to talk about it again in the autumn. Considering that they have published four of my books this year already, it makes sense not to publish more right now and start competing with myself.

Anyway, I got this mail with a suggestion for the back cover description and the message that the editor who wrote me, Steve Anglin, would try to get approval for sending an offer as early as next week.

Thing is, I am not actually done with the book yet. I have a complete draft, but I have only proofread a few chapters, and the way I write, I have lots of mistakes in a first draft. I write fast and I am not too careful with checking what I have written in the first iteration—it helps me get some writing done quickly and I know I will need to go through the text later to edit it anyway.

I will go through the remaining chapters over the coming week and get them proofread, but I also want to consider if there is more I should add to the book. I had plans to get some of my computer science friends to read it and make suggestions, but if I am selling it soon I don’t know if they will be willing to read it in a hurry. I am sure it would make the book better, though, so I will try.

If you are into data structures and willing to give it a quick read through, though, please let me know. I am not talking about proofreading or copy editing so much at this stage, but checking the data structures I describe are described correctly and that what I write is up to date with the literature, and of course make suggestions for more to add. If you are willing to, I will get you one of my free copies when the book is published in return.

If you are not up to giving the book a quick read through right now, but are interested in it, you might want to consider getting it now at Gumroad. I will update the book there once I’m done with proofreading and again if I add more material, but after the book is sold I will take it down. Right now the price at Gumroad is $1.99 (but you can pay more if you want). After I’m done with proofreading it will go up to $2.99. I spend the money on writing software and subscriptions—such as a subscription to Grammarly that helps me a lot with the proofreading—so I don’t charge a lot. Once the publisher needs his cut, though, the price goes up. My current published books are priced in the range $19-$35 (and my cut is around the same as when I sold them myself), so if you are happy with a PDF version, you should consider getting it now.

Of course, for programming books like this, you might prefer printed versions. I know I do. So I am very happy that this one looks like it will be published in proper print before long.