20 Jul

Slouching towards completion

I have finished the chapter on visualisation. I decided not to include ggvis after all. I think later I want to write a chapter on shiny together with ggvis, but without writing about dynamic documents I don’t have much reason to include ggvis just yet.

Current status

I have four chapters yet to write and two of these at least are chapters I don’t really know how to write.

I want to write something about dealing with large data sets. I don’t mean Big Data — I think that is a completely different topic and one that is way beyond the scope of my class and this book — but I want to say a bit on how to deal with data when you run into problems with size. I mostly do that for plotting where, say, scatterplots have too many points and you need to summerise the data instead. But maybe also something about using dplyr with SQL or using data.tables instead of data.frames. I still haven’t quite figured out what to put there.

For the next class I am teaching I also need to have a chapter on a data analysis project. Previous years the data analysis project has been the main topic of the class and every student has picked a data set to see what they can get out of it. I am still going to use that for the class — I think you learn much more from analysing a new data set where you don’t know what you will find — but I want to have a chapter with an example. I need to pick a good data set I can use to illustrate the topics in the previous chapters. I am not too worried about that, though. That should be simple enough.

Then there is a chapter for the “programming” part of the book. I am thinking optimisation but that might change. I am not in a hurry to get it done, though, that class is in the second half of autumn, but I would like to have it done earlier so I can focus on editing the book during autumn.

The fourth missing chapter is just the conclusions. I want to have some pointers to other books worth reading, but I will just update that as I think of books. No worries there.

Tomorrow I need to write on the “large data” chapter. I would be really grateful for ideas on what to put into it.

17 Jul

More chapters finished…

The writing is going slower than I had anticipated but I still think I should manage to finish the book before the end of the month. I have drafted chapters on writing reports in R Markdown and on using supervised and unsupervised learning algorithms.

Current progress

Of course, the time estimate is based on guesstimates of word count — and I really don’t know how many words go into a chapter until I am done with it — and it isn’t really the writing that takes time as it is figuring out what to include in a chapter.

For the chapters I have written so far I had a good idea about what to write. The remaining chapters are more problematic. I have an idea for what to write in the plotting chapter — basic graphics, ggplot2 and ggvis (but I need to get familiar with ggvis for that; I have only played with it a little and not really used it much) — but for the “big data” chapter I haven’t really thought it through. The chapter on profiling and optimisation I haven’t really thought much about either but I don’t really need it for the first class I am going to teach — it won’t be used until the second half of autumn, so I am not too worried about it.

What Project 1 should be I still have no idea.

Anyway, you can download the current version at Leanpub. If you don’t want to sign up there for it, send me an email and I will send you the book.

I have set the price for the book on Leanpub as free, also suggested to be free, but two have already decided to pay for it. I consider that a bit crazy since it isn’t even a half-finished book yet, but I am grateful. I just don’t know how to deal with paypal so I don’t know what to do with the payment yet. When I figure it out I think I will use the money to figure out how to make a hardcopy of the book. I think that can be done with Lulu if I pay for a test print, so I will look into that later. For now, I have to focus on getting the remaining chapters written.

06 Jul

Getting my book

I got a lot of positive feedback after writing here that I am working on a book and a lot of people wants to see a few chapters so I decided to share the book in its current very early state.

I put it on lean pub just because it is a site I know. At least it is a way for me to share the book. Now they do suggest that you pay a bunch of dollars to download it. Don’t do it. You can scale the price down to zero and I would recommend that you do. At this point it isn’t worth our money and in any case I don’t know how to deal with the tax agency if I got any money from the US. Please just scale the price down to zero and we will all just be happier.

I would really love to get feedback on the book but I need to finish the chapters I have planned so I hope you understand that if you write me comments I will ignore them until I have finished the chapters I have planned. After that I will get started with editing and then your comments would be really helpful.

06 Jul

Data manipulation in R

I just finished my chapter on data manipulation. It covers using dplyr and tidyr and how to import data using read.table and that family.

I have a section on the readr package but since it is not a package I use much myself I am not sure I tread it correctly. Are there people out there who use it frequently I can ask for advice?

Screen Shot 2016-07-06 at 15.51.40

05 Jul

Book project

Okay, I have decided to write a book on data science and R programming. I teach two classes on Data Science in Bioinformatics, although the in bioinformatics part is there because it is part of our bioinformatics Masters’ program; the classes are not particularly aimed at bioinformatics but cover data science in general.

The first class focuses on data analysis and the second on developing software for statistics and machine learning. I have looked for good text books for these classes but never managed to find any that fitted my needs. Don’t get me wrong, there are a lot of very good books on R programming and on R data analysis, but the good books are usually focused on specific problem domains (for data analysis) or either very basic introduction to programming or very technical on advanced topics (for programming and package development).

So, for the data analysis class I have just been referring to package documentation and tutorials and for the programming class I wrote my own lecture notes.

This year I am teaching the same two classes again, but from next year our terms are changing. Where we now have four terms a year, two in the spring and two in the autumn, from next year we will have two terms. This means that my two data science classes will be merged into one, so I want to have a set of lecture notes to cover both classes.

So I have decided to write a text book that will cover both data analysis and programming.

Book cover

That is a cover I made on canva on a lazy afternoon. The picture is just my laptop in my kitchen on a rare sunny day. Making a cover for a book was pure procrastination but I had nothing better to do this weekend.

Content

Obviously I want to put into the book the topics I already teach so these are the chapters I plan to write. The first half is focused on data analysis and the second half on programming. That way I can use it this year where the data science classes are still split in two.

Data analysis

These are the chapters I have planned for the data analysis part. I am not entirely sure yet what goes into each, although I have written some text for some of them, but it is the general outline.

  • Introduction to R programming. In which we learn how to work with data and write data pipelines.
  • Reproducible analysis. In which we learn how to integrate documentation and analysis in a single document and how to use such document to produce reproducible research.
  • Data manipulation. In which we learn how import data, to tidy up data, to transform, and to compute summaries from data.
  • Visualizing and exploring data. In which we learn how to make plots for exploring data features and for presenting data features and analysis results.
  • Working with large data sets. In which we see how to deal with data with very large numbers of observations.
  • Supervised learning. In which we learn how to train models when we have data sets with known classes or regression values.
  • Unsupervised learning. In which we learn how to search for patterns we are not aware of in data.

After these chapters there is a project chapter going through steps of an analysis. Here the thought is that the project is described as the steps needed in an analysis but with the details left to the reader to try to implement the actual analysis pipeline.

The first chapter is just a brief tutorial to basic R programming. I want to teach pipelines using magrittr as the basic programming pattern. Piping data from one function to the next using the %>% operator is how I do my own programming these days and I think it is a better approach than individual statements with function calls and assigning to variables.

The second chapter is on using R Markdown to combine documentation and analysis. I haven’t quite figured out how much I want to write about there. The Markdown language, of course. The options you can add to code chunks. Using cached chunks — something I find immensely useful myself — and maybe how to cross-reference figures, but I am not sure how well that works for raw reports compiled in RStudio. I am writing the book using knitr and pandoc and use pandoc for cross-references, but I don’t know how much that really can work in a simple Markdown document. With the new R Notebook that will be in the next RStudio I might want to say something about that, but I don’t know if it will be out this year before I teach my class so I will probably leave it out until I revise the book later.

The next chapter is concerning importing data and manipulating it. I am not sure how much I want to say about readr but I want to mention it. Otherwise it is read.table() and that class of functions. Obviously dplyr is a large part of the chapter and so is tidyr.

The visualization chapter will focus on ggplot2. I might spend some time on other graphical systems but ggplot2 will be the focus. It is the best system, in my opinion, for exploring data and it is quite powerful for making final plots as well. I probably have to write a little about how to combine plots from basic graphics with ggplot2 graphs just because that is often needed.

Now for dealing with large data sets I don’t really know what I want to write yet. The computing grid I have for my own analyses has nodes with large amounts of RAM so it is not something I worry about that often. I mostly have to worry about data sizes when plotting, so I have to write something about that, but for actually dealing with large data sets I don’t know what to write yet. Different table formats? Interacting with SQL? (Which works well with dplyr so that is a bonus). I have to figure it out. I would love to hear suggestions, though. How do people deal with large data sets in R?

Supervised and unsupervised learning is easy enough to write. There are lots of packages for various machine learning algorithms so I will cover a few, but mostly focus on the basic ideas using simple statistical models like linear regression and PCA and such.

Software development

For the software development chapters I already have notes from my previous classes. The topics I want to cover are these:

  • More R programming. In which we return to the basis of R programming and get a few more details than the tutorial in chapter 1.
  • Advanced R programming. In which we explore more advanced features of the R programming language, in particular functional programming.
  • Object oriented programming. In which we learn how R models object orientation and how we can use it to write more generic code.
  • Building an R package. In which we learn what the basic components of an R package is and how we can program our own.
  • Testing and checking. In which we learn techniques for testing our R code and for checking the consistency of our R packages.
  • Version control. In which we learn how to manage code under version control and how to collaborate using GitHub.
  • Profiling and optimizing. In which we learn how to identify hotspots of code where inefficient solutions are slowing us down and techniques for alleviating this.

After this there is a second project where we build an R package. I already have this chapter written and it involves implementing a package for Bayesian linear regression.

The first of these chapters just takes the tutorial from the very first chapter and explains more technical details. Like default parameters for functions and scopes of variables and such. And the various data types and how they work.

The second chapter will focus on functional programming and how to write functions that vectorize and work well with %>% pipelines. Anything else worth mentioning there?

The object orientation chapter will just be on how to write polymorphic functions and the S3 object system. I’ve never used S4 for RC myself so I don’t feel comfortable writing about that, but maybe I should read up on it and write a little bit. I haven’t decided.

The next three chapters I think I have under control. It is pretty obvious what goes into them if you are familiar with writing R packages and using GitHub.

The last chapter is just a placeholder right now. I think I want to write about speeding up code but I am not sure. Most people really don’t need to — and if they need to speed up code they probably need to think about algorithms more than profiling in any case — but since I need 14 weeks for the classes I need a chapter here and speed might be the topic. If so, I will write about profiling and probably about Rcpp. Is anything else relevant here?

There are a few topics I have left out but might want to add chapters on later. Like dealing with strings. The kinds of analyses I have worked on myself haven’t included a lot of string processes, and when they do I analyze the data in Python, but there are good packages for it so it might be worth looking at. I am also thinking about writing about bioinformatics — after all, the classes are in the bioinformatics program — but I am not familiar with BioConductor so I don’t know what to write there. Suggestions will be very welcome.

Writing the book

Now, I have guestimated that if I write 2000 words per day, which takes 2-3 hours, I will be done by the first of August. That gives me a few weeks to edit and such before the first class starts.

Progress

Here the yellow chapters are the chapters I already have a draft of and the red chapters are the chapters I haven’t finished writing (or even started on yet).

The time estimate is based on the size of the lecture notes I had from last year, but it is nothing but a guess. In any case when the class start I will be done by definition. I can always improve the book in the next iteration.

When can you see the book?

Now, I would love to share the book and get feedback but I just know that if I get comments on it right now I will spend more time changes the existing text.

As soon as all the chapters are yellow I will put the book online and start to collect comments. If anyone wants to look at it before then I am happy to send it to them on the condition that I do not get comments back until I have finished writing draft chapters for the entire book.

Wish me luck in the writing process.