Adding units as types to numbers in R

This weekend I’ve been playing with an R package that assigns units to numerical data and checks unit consistency in arithmetic expressions.

It is something I’ve been wanting to write for a couple of weeks, after I had a chat with one of my PhD students that involved a lot of unit conversions. It is frighteningly easy to get the units wrong in an expression if you are not careful, but keeping track of units can easily be done completely automatically by a computer. Unlike my brain, the computer doesn’t have any problems keeping track of what units, which we can think of much like types, are associated with a value.

Such a package is certainly something I could find useful in my own work, when I am analysing coalescent models and need to scale between mutation rates, in years or generations, and coalescence rates, in 2N generations, where I never get it right the first time around. I could also see it as a nice example of something to include in my next book on object oriented programming in R. So looking into it has been on my todo-list for a little while.

Saturday morning I googled around and found that there was already such an R package on GitHub. I had been beaten by a couple of months. So I had a look at the existing package.

It assigns units to numerical values by setting an attribute that describes the units as a string and then overloads arithmetic operations so they check the equivalence of types when you add or subtract them, possibly by converting types to make them equal, and modifies the strings when you multiply or divide. All the manipulation of types is outsourced to the udunit2 package. Because the units are simply strings they are not simplified at the end of an operation.

This works fine, but isn’t a good example for me to use in a book, and it only works if you stick to units known by udunits2. So I dug into the code to see if I could write something based on it that I could use for my two purposes.

My version can be found here.

Representing expressions

The existing package creates unit holding numerical types using the standard pattern in R and has an as.units function for it. It takes a numeric type, a vector or matrix for example, and a string describing the units.

x < - as.units(1:4, "km/h")

The units are checked against the udunits2 package to make sure that they are valid units.

That check of the units won’t work if I want to allow user-defined units, so I had to get rid of that. If I allow arbitrary strings as units, though, a simple spelling mistake will introduce a new unit. Since expressions involving units types will be checked for unit consistency it is not something that will go unnoticed and cause subtle bugs, but it is still worth avoiding.

So instead of having arbitrary strings as units I decided to require that units are explicitly defined, but by being able to multiply plain numbers to units we can get a natural syntax for defining numerical values with associated units.

km <- make_unit("km")
h <- make_unit("h")
x <- 1:4 * km/h

I have pulled the units defined in udunits2 and put them in a list, ud_units, so you can also just evaluate expressions in the context of it and write

with(ud_units, {
     x <- 1:4 * km/h
})

To be able to manipulate units I have an explicit representation of the unit types. Objects of the units class are just numerical values with an attribute that points to a class of type symbolic_units and it is the latter that represents the actual unit type.

So far I have just implemented it as two lists, the nominator and denominator of units. Each are just represented as a sequence of unit dimensions. This isn’t quite as general as one could hope for, you cannot have non-integer numbers of a unit and you cannot have a log-transformation of a unit and such, but it is a start. I believe the design can be made more general by having the symbolic_units represented as expressions instead of an explicit fraction of unit symbols, but I will have to think a bit more about that.

But because I have a representation of units I can work with in R, I can simplify units so for example m/s/m will be reduced to type 1/s. Such a reduction is simple enough, you can just run through the nominator and denominator and catch when a symbol appears both places and then remove it from both nominator and denominator. Since m*m/m should have type m you can only remove a symbol the right number of times, but that is not hard to get right.

Unit conversion

Of course, you would also want to reduce something like km/m since that is essentially just 1000*m/m and that makes it a little more tricky. For each symbol pair you compare you should check if you can convert one to the other, not just whether they are the same symbol, and you need to keep track of conversion constants.

I didn’t come up with a nice way of doing it, so I just brute-forced it. I don’t expect unit lists to particularly long, so I don’t worry too much about runtime issues (and if that becomes a problem you can get a speedup just by removing the type in time-critical parts).

My implementation looks like this:

.simplify_units <- function(sym_units) {

# This is just a brute force implementation that takes each element in the
   # nominator and tries to find a value in the denominator that can be converted
   # to the same unit. If so, we pull out the conversion constant, get rid of
   # both terms, and move on. At the end we return a units object with the
   # conversion constant and the new symbolic units type. Converting units can then
   # be done as this <code>x <- as.numeric(x) * .simplify_units(units(x))</code>.

# Returning a units instead of a symbolic_units object is not idea, it means that
   # you cannot simply multiply or divide symbolic units together, you need to wrap
   # each pair-wise operator in units() but it is necessary when conversion constants
   # must be taken into account.

conversion_constant <- 1
   new_nominator <- sym_units$nominator
   new_denominator <- sym_units$denominator

delete_nom < - c()
   for (i in seq_along(new_nominator)) {
     for (j in seq_along(new_denominator)) {
       conversion <- .get_unit_conversion_constant(new_nominator[i], new_denominator[j])
       if (!is.na(conversion)) {
         conversion_constant <- conversion_constant * conversion
         delete_nom <- c(delete_nom, i)
         new_denominator <- new_denominator[-j]
         break
       }
     }
   }
   if (length(delete_nom) > 0)
     new_nominator < - new_nominator[-delete_nom]

as.units(conversion_constant, .symbolic_units(new_nominator, new_denominator))
}

The function .get_unit_conversion_constant function is responsible for checking if I can convert one unit into another. If so, I get the conversion constant, if not I get NA. This function is also used when explicitly changing a type in expressions like this:

x <- 1:4 m/s
units(x) <- km/h  # conversion from meters per seconds to km per hour

That conversion simply looks like this:

units<-.units <- function(x, value) {
   if (units(x) == value) # do nothing:
     return(x)

conversion_constant <- .get_conversion_constant(units(x), value)
   if (is.na(conversion_constant)) {
     stop(paste("cannot convert", units(x), "into", value))
   }

x <- conversion_constant * x
   attr(x, "units") <- value
   x
}

There is also some parameter checking at the beginning but I have cut that here to just show the relevant code. The .get_conversion_constant function checks if I can convert an entire units definition into another and it does that separately for nominator and denominator. If units always have their unit representation simplified that should give the right result. And getting the conversion constant is just a question of matching up unit names once more.

.get_conversion_constant_sequence <- function(s1, s2) {
   conversion_constant <- 1
   remaining_s2 <- s2
   for (i in seq_along(s1)) {
     for (j in seq_along(remaining_s2)) {
       convert <- .get_unit_conversion_constant(s1[i], remaining_s2[j])
       if (!is.na(convert)) {
         conversion_constant <- conversion_constant * convert
         remaining_s2 <- remaining_s2[-j]
         break
       }
     }
   }
   # if we make it through these loops and there are still units left in s2
   # then there are some we couldn't convert, and then we return NA
   if (length(remaining_s2) > 0)
     NA
   else
     conversion_constant
}

.get_conversion_constant <- function(u1, u2) {
   # if the expressions are well formed, and can be converted, we can convert
   # nominator and denominator independently. If either cannot be converted
   # then the function call returns NA which will also be returned (since NA and /)
   # will convert to NA.
   .get_conversion_constant_sequence(u1$nominator, u2$nominator) /
     .get_conversion_constant_sequence(u1$denominator, u2$denominator)
}

The remaining thing for me to figure out is how to implement that pair-of-units conversion in a general way. Right now I simply use udunits2 still, but for user-defined units that won’t work.

.get_unit_conversion_constant <- function(u1, u2) {
   # FIXME: Unit conversion only has limited support right now
   # I always just ask ud to convert units.
   su1 <- as.character(u1)
   su2 <- as.character(u2)

if (!udunits2::ud.are.convertible(su1, su2)) return(NA)
   ud.convert(1, su1, su2)
}

This, besides dealing with more complex unit expressions than fractions, is what I haven’t figured out how to do yet. Obviously I can do something with having a table of conversions, but I would need the transitive closure of such conversions and that could be a big table. I don’t expect it to be very bit, a typical program doesn’t deal with that many units, after all, but the user needs to be able to update it and I don’t want to compute the transitive closure every time I need to look something up, so I have to think a bit more about that. There would probably also need to be some consistency checking when a conversion constant is defined so you don’t get different values depending on how the transitive closure is computed.

I don’t expect it will be hard to program, I just haven’t come up with a nice solution yet. It is something to think about in the shower tomorrow.

Author: Thomas Mailund

My name is Thomas Mailund and I am a research associate professor at the Bioinformatics Research Center, Uni Aarhus. Before this I did a postdoc at the Dept of Statistics, Uni Oxford, and got my PhD from the Dept of Computer Science, Uni Aarhus.

Leave a Reply