Computer science, bioinformatics, genetics, and everything in between
Author: Thomas Mailund
My name is Thomas Mailund and I am a research associate professor at the Bioinformatics Research Center, Uni Aarhus. Before this I did a postdoc at the Dept of Statistics, Uni Oxford, and got my PhD from the Dept of Computer Science, Uni Aarhus.
I just finished preparing slides for my new computational thinking class, and since I don’t want to start on anything new half an hour before I leave the office, I went on twitter and saw this interesting post. It’s a programming challenge described here, with an imperative solution, and with a functional solution described in the first post.
The imperative solution is rather involved and looks like this:
The functional solution is more elegant but also a bit slower. Broken into three functions, as it is in the blog post referenced above, it looks like this:
Well, if you want to write fast code, your first thought should be: can I come up with a simple algorithm. Later, you can worry about implementing it.
Here, we have a problem that involves figuring out if a wall at any given position has higher walls to the left and the right, and if so, what the smallest wall to the left or right are. So, very easily, we can break it down into collecting a list of the tallest wall seen to the left and to the right of each position. This is something we can compute with the accumulate and accumulate_right functions from purrr. After that, we can solve the problem by mapping over these two vectors and the vector of wall heights, and we have a three line solution to the challenge.
Not only is this solution much simpler, it is also a lot faster.
Update: Ok, if I had been smarter half an hour ago I would have seen that the pmap expression was overkill. A vector expression will work just fine and be faster.
If you are familiar with Python programming you will also be used to list comprehension. It is syntactic sugar you use to construct lists from other lists (or any structure you can iterate through in general). It isn’t actually an invention from Python — you have it in a lot of programming languages — but Python is probably the most used of these, so you are more likely to know it from there.
List comprehension gives you syntactic sugar to write expressions that combine filtering and mapping. You can construct a list by evaluating an expression for each element in another list and combine it with a predicate that must be true for the element you want to include in the result. Using it, you can write algorithms very succinctly. For example, a quick sort in Python can be implemented in a few lines if you use list comprehension:
R doesn’t have list comprehension, but it does have excellent support for meta-programming that will let you add it if you write your own domain specific language — which will be the topic for one of my upcoming books.
I wanted to implement it as a test for what I want to write about, so I played around with it a bit over the weekend, and came up with a solution that lets me, at least, implement quick sort as simply as I can in Python:
If it works, and it does, it works like this:
Here’s a solution I came up with. It isn’t perfect, but it is a starting point for playing around with it:
I use the rlang package for it, rather than working with raw quotes and evals, but you could just as easily do that. I just want to use the excuse of writing the next book to get more familiar with that package and tidy eval, so that is what I use.
This implementation is a bit more general than Python’s list comprehension — I can handle more than one list and more than one predicate. I Python, you might think you can handle more than one list, but you’d be wrong. You can write expressions that involves more than one, but those are just nested versions of list comprehension. You get an outer product of the lists rather than an inner product. I get do that as well by nesting my solution, but I can handle more.
Anyway, let me walk you through my solution.
The lcomp function takes an arbitrary number of arguments, the first of which must be an expression, the rest captured by the “…” special argument. I translate the first argument into a quoted expression using the enquo function. This works like the quote function, in a way, but it first substitute the parameter the user actually wrote into the expression and then quotes it. Similar to if you used the substitute function, but it handles environments correctly so we can evaluate the expression in the context where the user wrote it. The remaining elements, and there can be an arbitrary number of those, are quoted in a similar way using the quos function.
We set the expression aside for a little bit and consider the other arguments. Those will be the lists we should iterate over and the predicates for them.
The way I intend the function to be used, I will require that you name lists — see the “x = …” argument in the quick sort example — but that you provide the predicates as boolean expressions — the third argument in the calls in the quick sort example.
Named arguments will have a name in the quoted list of expressions, the others won’t, so I can use this to distinguish between lists provided to the function and the predicates over them.
I evaluate the lists in the context where the user defined them using the eval_tidy function and I just get the raw quoted predicates for the predicates using the UQE function.
After these statements, I have in “lists” the named values of the lists provided, and in “predicates” I have expressions with no environment associated. The latter might be a problem that I have to deal with at some point, but I’ll get back to that.
Anyway, I have the lists so I should be able to compute the expression in the first argument to lcomp in the context of each element in the lists. This is something that calls for mapping a function over the lists, but first we must create such a function.
It isn’t entirely trivial how we should evaluate the body of this function. Obviously, the body of the function must be the expression provided as the first argument to lcomp, but this expression could refer to both elements in the lists and variables in the calling context.
The way I chose to implement this is this:
The body of the function I set to be the raw expression that was provided to the lcomp function. This looses all information about the context in which the expression was provided, but I handle that by making that the enclosing environment of the function. Anything that isn’t provided as arguments to the function will be available through this environment, then. For the arguments, I just shove in the lists. This makes the function take the right arguments, and it will have the lists as defaults. The expression could be evaluated just from these default arguments if it is a vector expression and all the arguments are vectors — something we could make sure that the quick sort algorithm would provide — but it won’t work if we are combining lists and vectors and the expression is not vectorised, so I map the function over the lists instead.
Now I have all the values for the output list, but I have ignored the predicates. I did that because the predicates are a bit harder to handle.
The way I handle the predicates is similar to how I compute the values, though. I create functions that evaluate each predicate over the list, and then I “and” them together to get a boolean vector I can use to pick out the values I should return.
Getting the values where all predicates are satisfied is now only a question of returning the values where the “keep_index” vector is true:
This really isn’t an ideal solution. I compute values that I don’t need for the final result, and worse, I might end up raising all kinds of exceptions if some of the predicates are preventing me from evaluating an expression on illegal input.
Ideally, I should compute the predicates first and then use that to pick out the elements in the lists I should evaluate the expression on. But by now, there was a bottle of wine on the table and dinner was ready, so I put that aside for a later play session.
I plan to make list comprehension a chapter in my next R book, so keep an eye out if for it if you want to know what I come up with.
I will put up a draft of that book up on Gumroad as soon as I get something written. As always, if you buy it there early, you get it cheap, and you will get updates up until I finish it. I will try to raise enough to pay a copy editor to go through it before I try to sell it — for Functional Data Structures in R I have made exactly $12.75 which won’t pay for many pages of editing (it might pay for salt on the bread but not the bread), so it didn’t happen there, but if you help me out you will get a proof read book. If you send me comments and suggestions, though, I will try to get you a printed copy if I manage to sell the book. I only get ten copies, though, so I can only send a copy to those that really help.
There is definitely still some work to be done, but for now I need to put it aside for a few weeks. I’m hoping to get some feedback on it from some algorithmic people and then make a final version I can send to Apress if they decide to give me an offer.
In the mean time, I’ve started thinking about the next R book. I think I will write about embedded domain specific languages. I already have some ideas for what to include, but not yet enough for a full book, so some more thinking is required.
It will be a lot of meta-programming, but this time around I will base it on tidyeval instead of raw quotes and eval. Tidyeval, in the rlang package, provides a lot of great tools to design and implement domain specific languages, and it will be fun to play around with that.
If I can get my new combination of iA Writer and WordPress to play nice together, I will give you an example in a post very soon.
I got an email last night from Apress. I had sent them a few chapters of Functional Data Structures in R back in spring, but back there we agreed to talk about it again in the autumn. Considering that they have published four of my books this year already, it makes sense not to publish more right now and start competing with myself.
Anyway, I got this mail with a suggestion for the back cover description and the message that the editor who wrote me, Steve Anglin, would try to get approval for sending an offer as early as next week.
Thing is, I am not actually done with the book yet. I have a complete draft, but I have only proofread a few chapters, and the way I write, I have lots of mistakes in a first draft. I write fast and I am not too careful with checking what I have written in the first iteration—it helps me get some writing done quickly and I know I will need to go through the text later to edit it anyway.
I will go through the remaining chapters over the coming week and get them proofread, but I also want to consider if there is more I should add to the book. I had plans to get some of my computer science friends to read it and make suggestions, but if I am selling it soon I don’t know if they will be willing to read it in a hurry. I am sure it would make the book better, though, so I will try.
If you are into data structures and willing to give it a quick read through, though, please let me know. I am not talking about proofreading or copy editing so much at this stage, but checking the data structures I describe are described correctly and that what I write is up to date with the literature, and of course make suggestions for more to add. If you are willing to, I will get you one of my free copies when the book is published in return.
If you are not up to giving the book a quick read through right now, but are interested in it, you might want to consider getting it now at Gumroad. I will update the book there once I’m done with proofreading and again if I add more material, but after the book is sold I will take it down. Right now the price at Gumroad is $1.99 (but you can pay more if you want). After I’m done with proofreading it will go up to $2.99. I spend the money on writing software and subscriptions—such as a subscription to Grammarly that helps me a lot with the proofreading—so I don’t charge a lot. Once the publisher needs his cut, though, the price goes up. My current published books are priced in the range $19-$35 (and my cut is around the same as when I sold them myself), so if you are happy with a PDF version, you should consider getting it now.
Of course, for programming books like this, you might prefer printed versions. I know I do. So I am very happy that this one looks like it will be published in proper print before long.
I have been using both Ulysses and iA Writer for a while now, and I have both applications on both my Macs and iOS devices. For my technical books, where I need to compile the text with Pandoc, I have used iA Writer—it is easier to use with plain Markdown files than Ulysses, I find—but with other books, I have used Ulysses.
Back when I started using the tools, you couldn’t embed images and such in iA Writer, but you can now (although with a different syntax than usual Markdown), and images are automatically uploaded to Medium and WordPress, something I find really useful and that Ulysses supported when I started using the two apps.
I like both apps, but I am not sure I will be using both if they start charging a subscription fee. Sure, since I have bought Ulysses earlier, I get a lifetime discount, and I am considering it, but on the other hand, I am mainly using iA Writer for my writing. In the most recent writing, I have really only used Ulysses for blogging. I can do that in iA Writer as well—in fact, I am writing this post in iA Writer.
If I switch to iA Writer completely, there are a few features I will miss:
The previews in Ulysses are nicer. You can create and install custom templates—although I don’t know exactly now that works—so I suppose it is possible to work around that.
iA Writer cannot export to ePub, so even for simpler books, I would have to go through Pandoc to make ebooks. A slightly annoying thing here is that Markdown and iA writer includes images and other files differently so there might have to be some processing to make this easy.
I will miss the word count tracking I can do in Ulysses, where I see a progress wheel that tells me how I am in relation to a goal. Those help me get roughly the correct word count that I have planned for. However, for all the books I pipe through Pandoc, I have used a spreadsheet for this anyway so it won’t be a major change.
I was just about to say that I would miss the way I can group sheets and folders and such in Ulysses, but I can see that I can make folders in the iCloud library, and that combined with including other files through their names, I don’t think this will be much of a problem. But I might think otherwise once I have to export to Markdown to create ePub files via Pandoc. I don’t know.
Anyway, I think my status, for now, is that I will try using iA Write exclusively for a while, and see how it feels. In the mean time, I still have my current Ulysses apps that I can use without a subscription, so I don’t have to switch right away, and I am not aware of any new features in Ulysses that makes that worthwhile yet.
If I find that I miss Ulysses, I will get the subscription, of course. I already have lots of subscriptions for writing anyway, with ReadCube, Bear and Grammarly all being more expensive than what Ulysses costs, but if I do most of my writing in iA Writer anyway, it will feel a bit silly to have a subscription to Ulysses just for blogging…