Okay, I have decided to write a book on data science and R programming. I teach two classes on Data Science in Bioinformatics, although the in bioinformatics part is there because it is part of our bioinformatics Masters’ program; the classes are not particularly aimed at bioinformatics but cover data science in general.
The first class focuses on data analysis and the second on developing software for statistics and machine learning. I have looked for good text books for these classes but never managed to find any that fitted my needs. Don’t get me wrong, there are a lot of very good books on R programming and on R data analysis, but the good books are usually focused on specific problem domains (for data analysis) or either very basic introduction to programming or very technical on advanced topics (for programming and package development).
So, for the data analysis class I have just been referring to package documentation and tutorials and for the programming class I wrote my own lecture notes.
This year I am teaching the same two classes again, but from next year our terms are changing. Where we now have four terms a year, two in the spring and two in the autumn, from next year we will have two terms. This means that my two data science classes will be merged into one, so I want to have a set of lecture notes to cover both classes.
So I have decided to write a text book that will cover both data analysis and programming.
That is a cover I made on canva on a lazy afternoon. The picture is just my laptop in my kitchen on a rare sunny day. Making a cover for a book was pure procrastination but I had nothing better to do this weekend.
Obviously I want to put into the book the topics I already teach so these are the chapters I plan to write. The first half is focused on data analysis and the second half on programming. That way I can use it this year where the data science classes are still split in two.
These are the chapters I have planned for the data analysis part. I am not entirely sure yet what goes into each, although I have written some text for some of them, but it is the general outline.
- Introduction to R programming. In which we learn how to work with data and write data pipelines.
- Reproducible analysis. In which we learn how to integrate documentation and analysis in a single document and how to use such document to produce reproducible research.
- Data manipulation. In which we learn how import data, to tidy up data, to transform, and to compute summaries from data.
- Visualizing and exploring data. In which we learn how to make plots for exploring data features and for presenting data features and analysis results.
- Working with large data sets. In which we see how to deal with data with very large numbers of observations.
- Supervised learning. In which we learn how to train models when we have data sets with known classes or regression values.
- Unsupervised learning. In which we learn how to search for patterns we are not aware of in data.
After these chapters there is a project chapter going through steps of an analysis. Here the thought is that the project is described as the steps needed in an analysis but with the details left to the reader to try to implement the actual analysis pipeline.
The first chapter is just a brief tutorial to basic R programming. I want to teach pipelines using magrittr as the basic programming pattern. Piping data from one function to the next using the %>% operator is how I do my own programming these days and I think it is a better approach than individual statements with function calls and assigning to variables.
The second chapter is on using R Markdown to combine documentation and analysis. I haven’t quite figured out how much I want to write about there. The Markdown language, of course. The options you can add to code chunks. Using cached chunks — something I find immensely useful myself — and maybe how to cross-reference figures, but I am not sure how well that works for raw reports compiled in RStudio. I am writing the book using knitr and pandoc and use pandoc for cross-references, but I don’t know how much that really can work in a simple Markdown document. With the new R Notebook that will be in the next RStudio I might want to say something about that, but I don’t know if it will be out this year before I teach my class so I will probably leave it out until I revise the book later.
The next chapter is concerning importing data and manipulating it. I am not sure how much I want to say about readr but I want to mention it. Otherwise it is read.table() and that class of functions. Obviously dplyr is a large part of the chapter and so is tidyr.
The visualization chapter will focus on ggplot2. I might spend some time on other graphical systems but ggplot2 will be the focus. It is the best system, in my opinion, for exploring data and it is quite powerful for making final plots as well. I probably have to write a little about how to combine plots from basic graphics with ggplot2 graphs just because that is often needed.
Now for dealing with large data sets I don’t really know what I want to write yet. The computing grid I have for my own analyses has nodes with large amounts of RAM so it is not something I worry about that often. I mostly have to worry about data sizes when plotting, so I have to write something about that, but for actually dealing with large data sets I don’t know what to write yet. Different table formats? Interacting with SQL? (Which works well with dplyr so that is a bonus). I have to figure it out. I would love to hear suggestions, though. How do people deal with large data sets in R?
Supervised and unsupervised learning is easy enough to write. There are lots of packages for various machine learning algorithms so I will cover a few, but mostly focus on the basic ideas using simple statistical models like linear regression and PCA and such.
For the software development chapters I already have notes from my previous classes. The topics I want to cover are these:
- More R programming. In which we return to the basis of R programming and get a few more details than the tutorial in chapter 1.
- Advanced R programming. In which we explore more advanced features of the R programming language, in particular functional programming.
- Object oriented programming. In which we learn how R models object orientation and how we can use it to write more generic code.
- Building an R package. In which we learn what the basic components of an R package is and how we can program our own.
- Testing and checking. In which we learn techniques for testing our R code and for checking the consistency of our R packages.
- Version control. In which we learn how to manage code under version control and how to collaborate using GitHub.
- Profiling and optimizing. In which we learn how to identify hotspots of code where inefficient solutions are slowing us down and techniques for alleviating this.
After this there is a second project where we build an R package. I already have this chapter written and it involves implementing a package for Bayesian linear regression.
The first of these chapters just takes the tutorial from the very first chapter and explains more technical details. Like default parameters for functions and scopes of variables and such. And the various data types and how they work.
The second chapter will focus on functional programming and how to write functions that vectorize and work well with %>% pipelines. Anything else worth mentioning there?
The object orientation chapter will just be on how to write polymorphic functions and the S3 object system. I’ve never used S4 for RC myself so I don’t feel comfortable writing about that, but maybe I should read up on it and write a little bit. I haven’t decided.
The next three chapters I think I have under control. It is pretty obvious what goes into them if you are familiar with writing R packages and using GitHub.
The last chapter is just a placeholder right now. I think I want to write about speeding up code but I am not sure. Most people really don’t need to — and if they need to speed up code they probably need to think about algorithms more than profiling in any case — but since I need 14 weeks for the classes I need a chapter here and speed might be the topic. If so, I will write about profiling and probably about Rcpp. Is anything else relevant here?
There are a few topics I have left out but might want to add chapters on later. Like dealing with strings. The kinds of analyses I have worked on myself haven’t included a lot of string processes, and when they do I analyze the data in Python, but there are good packages for it so it might be worth looking at. I am also thinking about writing about bioinformatics — after all, the classes are in the bioinformatics program — but I am not familiar with BioConductor so I don’t know what to write there. Suggestions will be very welcome.
Writing the book
Now, I have guestimated that if I write 2000 words per day, which takes 2-3 hours, I will be done by the first of August. That gives me a few weeks to edit and such before the first class starts.
Here the yellow chapters are the chapters I already have a draft of and the red chapters are the chapters I haven’t finished writing (or even started on yet).
The time estimate is based on the size of the lecture notes I had from last year, but it is nothing but a guess. In any case when the class start I will be done by definition. I can always improve the book in the next iteration.
When can you see the book?
Now, I would love to share the book and get feedback but I just know that if I get comments on it right now I will spend more time changes the existing text.
As soon as all the chapters are yellow I will put the book online and start to collect comments. If anyone wants to look at it before then I am happy to send it to them on the condition that I do not get comments back until I have finished writing draft chapters for the entire book.
Wish me luck in the writing process.