Dictating instead of writing today

I have done something to my back that makes it very painful to sit at the computer today. It’s probably because I’ve been sitting too long at the computer the last couple of days. I have had this problem before, but usually only if I’ve been using the mouse too much. That hasn’t been a problem lately, but I do spend a lot of hours at the computer these days compared to some weeks ago. Anyway, I have some lecture notes that I would like to finish this week, so I can’t really give it a complete rest. So this gives me an excuse to try dictating text instead of writing.

I am dictating this blog post on my phone. I am dictating in the Ulysses editor. That is the editor I use for blogging. For some reason, I cannot dictate in Ulysses on my iPad. On the iPad I can dictate to iA Writer, but not Ulysses. On the phone I can dictate in both.

I’m surprised at how well dictation is working. When I tried it some years ago it really didn’t understand my thick Danish accent. Now it seems to understand most of the words I’m saying; I do have to change a few of them after I’ve dictated, but mostly to get stuff right. I still prefer using the keyboard over the microphone, but dictation is a working solution while my back is painful.

Fun with non-standard evaluation and binding parameters

I am thinking about examples for something useful but simple that can be done with non-standard evaluation in R for when I, in the far future, get to that book. And preferably something I can find a use for myself so it isn’t just wasted on writing but also helps me in my programming.

One thing I miss in R is assigning to more than one variable in pattern matching, or just for tuples or lists for that matter. So that is something I could start playing with. So I did, and you can see the first result on GitHub.

Right now it is just equivalent to tuple/list assignment in Python where you can assign multiple values like this:

x, y, z = 1, 2, 3

and you can use this for vectors and lists. Nothing fancy with pattern matching at all.

The syntax is inspired by the boost implementation of this in C++ and you use a function bind to specify the variables you want to bind and then an infix operator %< -% to assign to the binding.

bind(x, y, z) %<-% 1:3

I briefly considered overwriting the assignment operator, <-, because I just know that I am likely to write

bind(x, y, z) <- 1:3

in such expressions, but sanity prevailed and I left the assignment operator the way it is. It breaks all kinds of things that I cannot fully understand if I mess with it. So therefore a new infix operator %<-%.

The implementation is pretty simple. I just collect the parameters and the enclosing environment in the bind function and then put values in the environment in the %<-% function.

bind <- function(...) {
  bindings <- eval(substitute(alist(...)))
  scope <- parent.frame()
  structure(list(bindings = bindings, scope = scope), class = "bindings")
}

.unpack < - function(x) UseMethod(".unpack")
.unpack.default <-  function(x) x
.unpack.list <- function(x) x[[1]]
<code>%<-%</code> <- function(bindings, value) {
  if (length(bindings$bindings) > length(value))
    stop("More variables than values to bind.")

for (i in seq_along(bindings$bindings)) {
    variable <- bindings$bindings[[i]]
    val <- .unpack(value[i])
    assign(as.character(variable), val, envir = bindings$scope)
  }
}

(Sorry for the formatting of the %<-% name here, I don’t know why the highlighter absolutely has to replace backticks with code-tags, but it does).

The unpack helper functions are just there so I get the value when I have a list and not a length-one list. Otherwise, there is nothing much to it.

For my next trick, I need to figure out how I can bind variables by accessing the names in a value list.

Trying out the Ulysses editor

For some reason iA Writer stays at ~50% CPU on my laptop, draining the battery and heating up the entire house, so I went looking for another editor to play with.

I have been looking at Ulysses for a while and found it a bit pricy, but what the hell, if it is a great editor it is worth it. So I am trying it out, and I am writing this post on the iPad version. For some reason I had trouble connecting with WordPress from the laptop.

Actually I can not connect to WordPress.com on either version of the app but I can connect to my own server on both. There is just a timeout when I try to publish from the laptop. I will look into that later.

I am pretty happy with it so far. There is a nice distraction free mode, reasonable word count and other statistics, and the way text is organized as sheets you can move around, although something you need to get used to, is pretty neat.

My only main concern is that you only have all the features if your files are imported into the application. You can edit plain Markdown files but you only get a limited number of the features then. Which means that it is better to import but for my writing that destroys the workflow I have with knitr and Pandoc. I have to remember to export explicitly before I can compile a document.

Now Ulysses actually have some nice style sheets for exporting so I guess it isn’t a problem for most, but I do need to run my documents through knitr for evaluating R code so I can’t work exclusively with Ulysses. I will see how it goes during the next week where I will try to do my writing in Ulysses.

I haven’t tested things like cross reference and citations yet so I don’t knowi what f it can do that. Otherwise I definitely need the Pandoc workflow and then it is annoying if it is necessary to explicitly export.

We will see how I like it after a week, I guess.

Adding units as types to numbers in R

This weekend I’ve been playing with an R package that assigns units to numerical data and checks unit consistency in arithmetic expressions.

It is something I’ve been wanting to write for a couple of weeks, after I had a chat with one of my PhD students that involved a lot of unit conversions. It is frighteningly easy to get the units wrong in an expression if you are not careful, but keeping track of units can easily be done completely automatically by a computer. Unlike my brain, the computer doesn’t have any problems keeping track of what units, which we can think of much like types, are associated with a value.

Such a package is certainly something I could find useful in my own work, when I am analysing coalescent models and need to scale between mutation rates, in years or generations, and coalescence rates, in 2N generations, where I never get it right the first time around. I could also see it as a nice example of something to include in my next book on object oriented programming in R. So looking into it has been on my todo-list for a little while.

Saturday morning I googled around and found that there was already such an R package on GitHub. I had been beaten by a couple of months. So I had a look at the existing package.

It assigns units to numerical values by setting an attribute that describes the units as a string and then overloads arithmetic operations so they check the equivalence of types when you add or subtract them, possibly by converting types to make them equal, and modifies the strings when you multiply or divide. All the manipulation of types is outsourced to the udunit2 package. Because the units are simply strings they are not simplified at the end of an operation.

This works fine, but isn’t a good example for me to use in a book, and it only works if you stick to units known by udunits2. So I dug into the code to see if I could write something based on it that I could use for my two purposes.

My version can be found here.

Representing expressions

The existing package creates unit holding numerical types using the standard pattern in R and has an as.units function for it. It takes a numeric type, a vector or matrix for example, and a string describing the units.

x < - as.units(1:4, "km/h")

The units are checked against the udunits2 package to make sure that they are valid units.

That check of the units won’t work if I want to allow user-defined units, so I had to get rid of that. If I allow arbitrary strings as units, though, a simple spelling mistake will introduce a new unit. Since expressions involving units types will be checked for unit consistency it is not something that will go unnoticed and cause subtle bugs, but it is still worth avoiding.

So instead of having arbitrary strings as units I decided to require that units are explicitly defined, but by being able to multiply plain numbers to units we can get a natural syntax for defining numerical values with associated units.

km <- make_unit("km")
h <- make_unit("h")
x <- 1:4 * km/h

I have pulled the units defined in udunits2 and put them in a list, ud_units, so you can also just evaluate expressions in the context of it and write

with(ud_units, {
     x <- 1:4 * km/h
})

To be able to manipulate units I have an explicit representation of the unit types. Objects of the units class are just numerical values with an attribute that points to a class of type symbolic_units and it is the latter that represents the actual unit type.

So far I have just implemented it as two lists, the nominator and denominator of units. Each are just represented as a sequence of unit dimensions. This isn’t quite as general as one could hope for, you cannot have non-integer numbers of a unit and you cannot have a log-transformation of a unit and such, but it is a start. I believe the design can be made more general by having the symbolic_units represented as expressions instead of an explicit fraction of unit symbols, but I will have to think a bit more about that.

But because I have a representation of units I can work with in R, I can simplify units so for example m/s/m will be reduced to type 1/s. Such a reduction is simple enough, you can just run through the nominator and denominator and catch when a symbol appears both places and then remove it from both nominator and denominator. Since m*m/m should have type m you can only remove a symbol the right number of times, but that is not hard to get right.

Unit conversion

Of course, you would also want to reduce something like km/m since that is essentially just 1000*m/m and that makes it a little more tricky. For each symbol pair you compare you should check if you can convert one to the other, not just whether they are the same symbol, and you need to keep track of conversion constants.

I didn’t come up with a nice way of doing it, so I just brute-forced it. I don’t expect unit lists to particularly long, so I don’t worry too much about runtime issues (and if that becomes a problem you can get a speedup just by removing the type in time-critical parts).

My implementation looks like this:

.simplify_units <- function(sym_units) {

# This is just a brute force implementation that takes each element in the
   # nominator and tries to find a value in the denominator that can be converted
   # to the same unit. If so, we pull out the conversion constant, get rid of
   # both terms, and move on. At the end we return a units object with the
   # conversion constant and the new symbolic units type. Converting units can then
   # be done as this <code>x <- as.numeric(x) * .simplify_units(units(x))</code>.

# Returning a units instead of a symbolic_units object is not idea, it means that
   # you cannot simply multiply or divide symbolic units together, you need to wrap
   # each pair-wise operator in units() but it is necessary when conversion constants
   # must be taken into account.

conversion_constant <- 1
   new_nominator <- sym_units$nominator
   new_denominator <- sym_units$denominator

delete_nom < - c()
   for (i in seq_along(new_nominator)) {
     for (j in seq_along(new_denominator)) {
       conversion <- .get_unit_conversion_constant(new_nominator[i], new_denominator[j])
       if (!is.na(conversion)) {
         conversion_constant <- conversion_constant * conversion
         delete_nom <- c(delete_nom, i)
         new_denominator <- new_denominator[-j]
         break
       }
     }
   }
   if (length(delete_nom) > 0)
     new_nominator < - new_nominator[-delete_nom]

as.units(conversion_constant, .symbolic_units(new_nominator, new_denominator))
}

The function .get_unit_conversion_constant function is responsible for checking if I can convert one unit into another. If so, I get the conversion constant, if not I get NA. This function is also used when explicitly changing a type in expressions like this:

x <- 1:4 m/s
units(x) <- km/h  # conversion from meters per seconds to km per hour

That conversion simply looks like this:

units<-.units <- function(x, value) {
   if (units(x) == value) # do nothing:
     return(x)

conversion_constant <- .get_conversion_constant(units(x), value)
   if (is.na(conversion_constant)) {
     stop(paste("cannot convert", units(x), "into", value))
   }

x <- conversion_constant * x
   attr(x, "units") <- value
   x
}

There is also some parameter checking at the beginning but I have cut that here to just show the relevant code. The .get_conversion_constant function checks if I can convert an entire units definition into another and it does that separately for nominator and denominator. If units always have their unit representation simplified that should give the right result. And getting the conversion constant is just a question of matching up unit names once more.

.get_conversion_constant_sequence <- function(s1, s2) {
   conversion_constant <- 1
   remaining_s2 <- s2
   for (i in seq_along(s1)) {
     for (j in seq_along(remaining_s2)) {
       convert <- .get_unit_conversion_constant(s1[i], remaining_s2[j])
       if (!is.na(convert)) {
         conversion_constant <- conversion_constant * convert
         remaining_s2 <- remaining_s2[-j]
         break
       }
     }
   }
   # if we make it through these loops and there are still units left in s2
   # then there are some we couldn't convert, and then we return NA
   if (length(remaining_s2) > 0)
     NA
   else
     conversion_constant
}

.get_conversion_constant <- function(u1, u2) {
   # if the expressions are well formed, and can be converted, we can convert
   # nominator and denominator independently. If either cannot be converted
   # then the function call returns NA which will also be returned (since NA and /)
   # will convert to NA.
   .get_conversion_constant_sequence(u1$nominator, u2$nominator) /
     .get_conversion_constant_sequence(u1$denominator, u2$denominator)
}

The remaining thing for me to figure out is how to implement that pair-of-units conversion in a general way. Right now I simply use udunits2 still, but for user-defined units that won’t work.

.get_unit_conversion_constant <- function(u1, u2) {
   # FIXME: Unit conversion only has limited support right now
   # I always just ask ud to convert units.
   su1 <- as.character(u1)
   su2 <- as.character(u2)

if (!udunits2::ud.are.convertible(su1, su2)) return(NA)
   ud.convert(1, su1, su2)
}

This, besides dealing with more complex unit expressions than fractions, is what I haven’t figured out how to do yet. Obviously I can do something with having a table of conversions, but I would need the transitive closure of such conversions and that could be a big table. I don’t expect it to be very bit, a typical program doesn’t deal with that many units, after all, but the user needs to be able to update it and I don’t want to compute the transitive closure every time I need to look something up, so I have to think a bit more about that. There would probably also need to be some consistency checking when a conversion constant is defined so you don’t get different values depending on how the transitive closure is computed.

I don’t expect it will be hard to program, I just haven’t come up with a nice solution yet. It is something to think about in the shower tomorrow.

Damn simple suffix array construction

As you may or may not know, suffix arrays are the key data structure for many string algorithms. Well, the key data structure is suffix trees but those take up way too much memory so we use suffix arrays instead because they can do the same things (with a little extra bookkeeping) while taking up a lot less memory.

We just finished our class on string algorithms this week and Storm (my co-teacher) and I were talking about simple suffix array constructions. There are many different algorithms for constructing suffix arrays. There are several algorithms for constructing suffix arrays in linear time — although in practise these are often slower than algorithms with worse big-O complexities — but we were just talking about very easy implementations of them.

In C, where strings are just pointers to to null-terminated byte buffers you can get a representation of all suffixes of a string — so all substrings that ends at the same point as the original string — by having a pointers into that string. So it is easy to represent the set of all suffixes as an array of pointers.

The suffix array is just an array that contains the rank (the order) of all suffixes, so if you can just soft such an array you have constructed the suffix array.

In C you can do that just by sorting the array of pointers.

The worst-case complexity of this is pretty bad — you have an O(n log n) complexity for sorting and each comparison could in principle take time n as well since you have to compare strings that on average n in length — but in practise you can compare strings a lot faster since they do not typically share long prefixes (so you can terminate the comparison quickly) and the n log n complexity doesn’t matter much when you are using an optimised search algorithm.

So you should be able to sort the suffixes in a few lines of code using qsort and strcmp.

You need a wrapper to make qsort work with strcmp, but other than that, it is really that simple.

Below is my implementation of that. It is a complete program that takes a string on the command line and outputs the suffix array for it.

I also made a version that reads a string from a file and on my laptop I can build a suffix array for a file that is around a hundred megabases in a minut or so (for a human chromosome) so it isn’t that bad in performance.

Here is my simple implementation:

#include <stdio .h>
#include <stdlib .h>
#include <string .h>

// Wrapper needed to make strcmp work with qsort
int strcmp_wrapper(char **x, char **y)
{
    return strcmp(*x, *y);
}

int main(int argc, const char * argv[])
{
    if (argc != 2) {
        printf("Usage: %s string\n", argv[0]);
        return 1;
    }

    const char *x = argv[1];
    size_t n = strlen(x);
    char *suffixes[n];

    // Collect all the suffixes in an array
    for (int i = 0; i < n; i++)
        suffixes[i] = (char*)x + i;
    printf("All suffixes:\n");
    for (int i = 0; i < n; i++)
        printf("Suffix %d: %s\n", i, suffixes[i]);
    printf("\n");

    // Sort the suffixes
    qsort(suffixes, n, sizeof(char<em>), (int(</em>)(const void*, const void*))&strcmp_wrapper);

    printf("Sorted suffixes:\n");
    for (int i = 0; i < n; i++)
        printf("Suffix %d: %s\n", i, suffixes[i]);
    printf("\n");

    // The actual suffix array should be indices and not the raw pointers so we need to do this
    int suffix_array[n];
    for (int i = 0; i < n; i++)
        suffix_array[i] = (int)((const char *)suffixes[i] - x);

    printf("Suffix array:\n");
    for (int i = 0; i < n; i++)
        printf("SA[%d] = %d\n", i, suffix_array[i]);
    printf("\n");
 
    return 0;
}