Almost done revising Introduction to Data Science

I’ve gone through and revised all chapters except the last project. Proofreading it and fixing the grammar and typos is not going to take so long, but there is a lot of math in it, and I want to be able to have both a kindle and ePub version, so I need conditional compilation with different math for the kindle version. That is going to take a little more than an hour to get done, and in about an hour I am off to a reception at Google, so I will let it rest now.

I will get the revisions done by tomorrow and upload it to my course home page, ready for next week’s teaching. It feels pretty good to finally be close to the end of editing. It is the least fun part of writing. I look forward to get back to my other book where I still have one more chapter to write before I need to go through and edit it.

Nationwide Genomic Study in Denmark Reveals Remarkable Population Homogeneity

Georgios Athanasiadis, Jade Y. Cheng, Bjarni J. Vilhjálmsson, Frank G. Jørgensen, Thomas D. Als, Stephanie Le Hellard, Thomas Espeseth, Patrick F. Sullivan, Christina M. Hultman, Peter C. Kjærgaard, Mikkel H. Schierup, Thomas Mailund

GENETICS Early online August 17, 2016; DOI: 10.1534/genetics.116.189241

Denmark has played a substantial role in the history of Northern Europe. Through a nationwide scientific outreach initiative, we collected genetic and anthropometrical data from ~800 high school students and used them to elucidate the genetic makeup of the Danish population, as well as to assess polygenic predictions of phenotypic traits in adolescents. We observed remarkable homogeneity across different geographic regions, although we could still detect weak signals of genetic structure reflecting the history of the country. Denmark presented genomic affinity with primarily neighboring countries with overall resemblance of decreasing weight from Britain, Sweden, Norway, Germany and France. A Polish admixture signal was detected in Zealand and Funen and our date estimates coincided with historical evidence of Wend settlements in the south of Denmark. We also observed considerably diverse demographic histories among Scandinavian countries, with Denmark having the smallest current effective population size compared to Norway and Sweden. Finally, we found that polygenic prediction of self-reported adolescent height in the population was remarkably accurate (R2 = 0.639±0.015). The high homogeneity of the Danish population could render population structure a lesser concern for the upcoming large-scale gene-mapping studies in the country.

More bindr

I extended my new bindr package a little today. If you recall it lets you assign to local variables from return values of functions, so, for example, you can use

f < - function(x, y) list(x = x, y = y)
bind(a, b) %<-% f(2, 3)

to assign to local variables a and b the values returned by the function f.

What I implemented today was a syntax for using variables returned by a function — if it returns a list or vector — in expressions for assigning to the local variables.

So now you can write code like this:

f <- function(x, y) list(x = x, y = y)
bind(a = 2 + x, b = a + 3 + y) %<-% f(2, 3)

You can combine positional arguments and expression like these, but probably shouldn’t. If you do, the position alone determines what you get assigned to, so

f <- function(x, y) list(x = x, y = y)
bind(a = y, b) %<-% f(2, 3)

will assign the value of y, 3, to both a and b.

The implementation is contained in the assignment operator and looks like this:

%<-% < - function(bindings, value) {
  if (length(bindings$bindings) > length(value))
    stop("More variables than values to bind.")
  var_names < - names(bindings$bindings)
  val_names < - names(value)
  has_names <- which(nchar(val_names) > 0)
  value_env <- list2env(as.list(value[has_names]),
                        parent = bindings$scope)
  for (i in seq_along(bindings$bindings)) {
    name <- var_names[i]
    if (length(var_names) == 0 || nchar(name) == 0) {
      # we don't have a name so the expression should be a name
      # and we are going for a positional value
      variable <- bindings$bindings[[i]]
      if (!is.name(variable))
        stop(paste0("Positional variables cannot be expressions ",
                    deparse(variable), "\n"))
      val <- .unpack(value[i])
      assign(as.character(variable), val, envir = bindings$scope)
    } else {
      # if we have a name we also have an expression and we evaluate that
      # in theenvironment of the value followed by the enclosing
      # environment and assign the result to the name.
      val <- eval(bindings$bindings[[i]], value_env)
      assign(name, val, envir = bindings$scope)
    }
  }
}

It is not terribly complicated. The only thing you need to keep track of is the environment in which to evaluate expressions, where I make one for the values given on the right-hand side, chain it with the calling frame (saved in the bind function call), and evaluate the expressions in this.

More revising…

I don’t have as much time to edit my Data Science book next week as I had foolishly thought. Next week we have the WABI workshop so I will spend some time there. So I tried to get through some more chapters today.

For the class that starts in about a week I only need the chapters up to, and including, the first project, and I will not have any problem getting those revised. I just would love to get it all done before I give the book to my students. It should still be possible even with WABI, though.

During these revisions I am also fixing the mathematical formulas so I can make a kindle ebook. Not that I think people will want to read a textbook on a kindle, but you can use it in the kindle application on a laptop or a tablet, and it could be useful there. The Mobi format that kindle uses, however, doesn’t support MathJax, so I cannot show more than the most basic formulas. I have therefore added some gpp preprocessing to have different formulas depending on the output format I generate.

The supervised learning chapter is the only chapter where I have needed this so far, but there are a couple of chapters in the last part of the book where I need it as well, in particular in the last project chapter.

Revising chapters

I have spent most of today revising the first three chapters of my Data Science book. It takes me almost as long to revise a chapter as it took me writing it, and I find it much harder to concentrate on it. I have already read the chapters and made corrections on printed copies, but it still takes me a long time to clean up the language and improve explanations and descriptions.

Because it requires a lot of concentration it is also very hard for me not to procrastinate, so I am forcing myself to work in sprints of 25 minutes and then I get to take a five minutes break to hang around on twitter or go through emails. If I didn’t do it this way, I would be spending way more than 50% of my time doing anything but revising.

I am using Grammarly to check the revisions I make, and to check errors I haven’t spotted myself when I read the hard-copies (and there are many!). I am happy I sprung for a Premium subscription yesterday.

It is not entirely simple to use it, though. I have to copy the text into it, edit, and then copy back which is mildly annoying. Worse, it seems that it cannot handle chapter-length texts that well. When it checks the text, it seems like it catches many more issues at the beginning of the text and then less towards the end. It rechecks from time to time, after I edit, and then new issues pop up that were also there before but just not shown. When it does this re-checking it forgets that I have told it to ignore certain issues. It keeps telling me that I am repeating words that are technical terms, chunks and functions for example, and I don’t want to change them into other words. Towards the end of a chapter there are many such issues coming earlier and I get the feeling that this means it is catching fewer issues at the end. It might just be my imagination, but that is the impression I get.

With long chapters I have also had it just stop highlighting issues. If I am half-way through a chapter I know there are more issues to address, but it is not showing them. Then I have to reload the document and then they pop up — together with all the issues I have told it to ignore.

So I have started copying sections instead of chapters into the tool and then edit those in isolation. That works reasonably well.

Tomorrow I will try to get through another three chapters, and then it should be easy enough to do the rest next week in time to have it all uploaded before classes start.