Posts Tagged ‘software development’

QBlossoc

Monday, June 16th, 2008

I’ve just made a release of our association mapping tool, Blossoc.  The new release, nicknamed QBlossoc, adds full support for quantitative phenotypes.  We’ve had support for quantitative phenotypes for a while now, but this version is the “official” release for it, with tuned default options and such.

It is mainly the work of Søren Besenbacher, who’s been running simulation experiments for the last several months to figure out which scoring methods, and with which parameters, works best.

Although the quantitative traits method hasn’t been published yet — we only submitted the paper last week — it has already been used by Monica Ledur et al. to analyse the data set from the XII QTLMAS workshop in Uppsala.  There should be a paper coming out on that as well.  I don’t know its status, but Monica was kind enough to send the manuscript to me.

Programming for (non-computer) scientists

Friday, June 6th, 2008

A few days ago, bbgm wrote about Programming and Science Education and how people advocating that students use Excel (or similar) instead of “real” programming languages.  I quote:

As a computational scientist (and with both a physical and a life science background), that such arguments still happen is appalling. IMO, all scientists, especially those remotely connected to theory and/or computational science should be given the opportunity to learn some formal software engineering and computer science principles (for physical chemists, bioinformaticians, etc is should be mandatory to do some courses).

I guess I agree, but I haven’t made up my mind to which degree I agree.

How much programming do you really need?

You can do a lot with spreadsheets and they are a really powerful tool once you get used to them.  I have never bothered myself, but I have seen what my colleagues manage to do with them.  So they may be sufficient for what you need and then it makes sense to learn how to get the most of them.

That being said, it is not a substitute for a programming language, and when you need a proper programming language you need a lot of hacking to do the same tasks in a spreadsheet.

If you need a complex statistical analysis of your data, you are better off using a language/environment such as R than trying to do the analysis in Excel.  Sure, Excel can do a lot of statistics, but only the most common types of analysis. If you need more, you need something like R.

Depending on how much manipulation of your data — and how much mathematical modelling you need to implement yourself — R might not be enough.  If you need to combine a lot of output of various programs, do a few manipulations on the output, and analyse that, you probably want to consider a scripting language like Python or Perl.

For heavy duty scientific programming, you want MATLAB or Octave or such.  When you need the extra speed, and your language does not already have an optimised solution, C or C++ is useful (either for a full application or for a module in your scripting or statistical language — they usually will allow you to make extensions in C/C++ that you can then use as any other module).

How much programming you need to do in scientific work varies a lot and there is not really any reason to spend time learning something that you will never use.

Maybe a spreadsheet is all you will ever need.  Just make sure that you are doing the analysis you want to do, not the one your tools limits you to, and if you need more, then spend the time learning how to program!

Learning how to program

Remember, though, that learning to program a computer takes time.  It can take a lot of time before you do it well.

There is a lot more to it than learning the syntax of a programming language.  You need to know the right way to solve given problems in your language of choice — the optimal way in one programming language can be very different from the optimal way in another, depending on libraries, language features, underlying philosophy of the language, etc. — and in general you need to learn how to think like a programmer.

It might be a different kind of problem solving skills than what you are used to.

Most of all it takes practise.  This can be frustrating if you pick up a language to solve a specific problem that you are more interested in than learning how to program.

If you only very rarely need to program, don’t bother.  Find someone to help you out.  This is how we bioinformaticians get involved in various projects so although we complain about being over worked, we don’t really mind so much. It is not different from having to consult a statistician from time to time to make sure that you are analysing your data correctly.  If you do not need it that often, your time might be better spend on other tasks.

If you often need to solve programming problems, you should learn how to do it properly.  You get a better feeling for the problems and for the data when you work with it, and you don’t want to out-source that to other people.

If you decide you need programming in your research — and I think more and more sciences rely on IT so maybe you do before you know it — then practice practice practice.  Experiment with your programming language.  Read discussion fora and user groups.  Read other peoples code and see how they solve similar problems.

Just as it takes time to learn how to use a spreadsheet properly — and just learning all the features it offers — it takes time to learn how to program.  Usually more so, as programming languages are much more powerful.  Spend the time!  There is no fast way around this, there really isn’t.

GSL and C++

Tuesday, May 27th, 2008

Yesterday, in my machine learning class, I showed some example code for artificial neural networks. I am giving my students a project tomorrow where they have to do regression using neural networks, and I wanted to show them how to program these networks. The pictures we draw of networks, and the mathematics we do with them, looks less like the implementation than the other machine learning methods we cover in the course, so I thought a code demo would be in order.

The project

It’s a simple feed-forward neural network I want them to implement in their project, so essentially they just need to be able to calculate target values from predictor variables and the network weights and then minimise the error function (as a function of the network weights).

target = y(x,w)
training: argminw Σn(tny(xn,w))2
Using the back-propagation algorithm they can get the gradient of the error function with respect to w and use that in the optimisation.Now, I could ask them to use the gradient descent algorithm or similar for the optimization, but this is not a course on numerical methods (even simple ones) and I don’t want to give them the impression that numerical optimization is simpler than it really is, so I suggest to them that they use a library for the optimization.

Example implementations

For class I showed how simple neural networks could be implemented and trained in Python, using scipy to do the training. This works very fine for the demonstration, since I can interactively train the networks to various target functions and plot how well the network fits the target.

To show that the language is more or less arbitrary for the solution, I also implemented a neural network in C++, this time using GSL to minimize the error function.

In theory, there shouldn’t be much difference between the Python and the C++ solutions — and for the actual neural network implementation there isn’t — but using GSL is much more tedious than the Python approach. It really shouldn’t be.

The main problem is that GSL is a C library and not a C++ library, so the features in C++ that would make the C++ solution just as simple and intuitive as the Python solution are simply not supported. One thing is operator overloading. It is missing from C, so working with vectors is a bit annoying. Especially since, in C++, you are used to the STL vectors, and switching back and forth between std::vector<> and gsl_vector * is a bit annoying. It is not the worst problem, though. Not in the neural network application, at least (it might be for more linear algebra intensive applications).

The one thing that annoyed me the most was how I should call the minimization function. In Python it was reasonably easy using high-level functions, so I could curry my error function and create a function of just the weights. In GSL, I had to go through a void pointer. In C++, I would have used a functor to deal with this, but since GSL is C it doesn’t have templates, so I couldn’t do that.

I wasn’t happy with resource management either. GSL is clearly not designed with exceptions in mind — after all, they do not exist in C — and this means that you either have to wrap all GSL types in small specialized classes so they will be destroyed properly, or you need to be careful with calling the correct <type>_free() functions. The later is very hard to do exception safe.

C++ wrappers for GSL?

I googled around a bit for C++ wrappers for GSL, and it seems like there are a few around, but not really any I found particularly attractive. GSL for C++ doesn’t appear particularly active (but I could be wrong). GSLwrap seems like it only wraps the linear algebra.

Are there any good GSL wrappers around, that use the high level features of C++?  Or any obvious replacement?

Just overloading operators to provide linear algebra syntax for matrix manipulation is not the way to go, unless there is some pretty heavy template magic involved, I think.  The methods in BLAS are pretty efficient because they can do a lot of operations with a single call.  Doing the same using operators would be very inefficient.  That is why the efficient C++ solutions use template expressions.  I am not sure that there is an easy way to do this through GSL, although the template expressions could end up just calling the BLAS routines, of course.

What about high level functions (functors) for optimization?

Debugging fun

Saturday, May 10th, 2008

Yesterday I got a simulator from Garrett Hellenthal that I need to simulate some genetics data (combined “panel” and “tagSNP” data, if you want to know).

The program didn’t work for me. On the example data provided with the code, it complained about invalid input. Right! Naturally, I expected that Garrett had given me test data that was in some way outdated. It happens quie often that the test data isn’t updated when new features are added to a program, and that test examples no longer function the way they should. I make that mistake a lot, anyway.

So I got hold of Garrett, and yes the test example was incorrect and he sent me an updated test. It didn’t work either. We emailed back and forth a bit, and this time it was a bit of a puzzle. Everything worked fine at Garrett’s end, but the program just wouldn’t eat the input on my computer. I didn’t have the time to figure out why, yesterday, as I had a plane to catch, but I had another look at it today.

I first grep’ed for the error message I got: “CV frequencies must be between 0 and 0.50″. That line occurred a few places in the code, so I annotated the code with some output before each and that pinpointed this test as the problem:

if ((allelefreq[count] < epsilon[count]) || (allelefreq[count] > (0.50-epsilon[count]))) {
   printf("CV frequencies must be between 0 and 0.50\n");
   exit(1);
 }

Ok, so on Garrett’s machine, the test was false and on mine it was true, but the input files were the same. First I tested if the input was read in correctly (Garrett’s IO code is a bit of a mess, excuse me for saying it). allelefreq[count] was supposed to be 0.4 — and it was — and epsilon[count] should be 0.1 — and it was.

That’s when I spotted the problem. Do you see it?

Clearly 0.4 ≥ 0.1 and 0.4 ≤ 0.5-0.1 so the test should evaluate to false, but it doesn’t.

The thing is, floating point numbers aren’t real numbers (as you probably know). And for one thing, 0.1 cannot be represented in a finite number of bits (as a binary number) so it is approximated in the computer. So 0.5-0.1 is not 0.4 (although they could be represented by the same bit-patter as 0.4 is represented as). Whether 0.4 ≤ 0.5-0.1 or 0.4 > 0.5-0.1 on the computer depends on the way the numbers are represented.

Garrett was running his program on a 64 bit architecture (where the numbers were represented using 128 bits) and I on a 32 bit architecture (with 64 bits per number). That made all the difference.

Fun, eh?

Painting “go faster” stripes on HapCluster

Friday, March 21st, 2008

These days I’m working a lot on my HapCluster association mapping tool. I’m hoping to make a version that scales to genome wide studies.

My “back of an envelope” calculations based on an older version of HapCluster told me it should be possible to analyse a whole genome data set in about a CPU week, which is roughly the time it takes to phase it using the BEAGLE tool, which is the fastest I know about. Since HapCluster doesn’t need phasing, it would then be just as fast as analysing the data with Blossoc after phasing (which is much faster than using Blossoc on data that isn’t phased).

Unfortunately, some recent changes to HapCluster has made it much slower…

How did I get started with HapCluster?

HapCluster is a method developed by Ed Waldron, David Balding and John Whittaker, all from Imperial College, London (although Ed is now working for a pharma company, as far as I know). The method is described in the paper

Fine Mapping of Disease Genes via Haplotype Clustering.
E.R.B. Waldron, J.C. Whitaker and D.J. Balding. Genetic Epidemiology. 30: 170–179. (2006)

I got involved with the method after attending a talk by Ed in Oxford (and discussing it with him in the pub afterwards — take that, recent Beer study!). He had implemented the method in R and I suggested to re-implement it in C++, mainly because I wanted to make a fair comparison with Blossoc — Blossoc was slightly better at localising disease variants, but a runtime comparison with an R implementation would be a bit unfair.

So anyway, he sent me the R code and I translated it into C++, but along the way got a few ideas to further improvements and started working with those. Later I met up with Ed and David a few times to continue the work. I’m going to visit David for about a week next month as well.

How does HapCluster work?

The method is quite simple, which is probably also why it is reasonably fast. It is based on clustering sequences, based on their local similarity around a possible disease locus, and then assign a likelihood to the clustering based on the way cases and controls are distributed in the clusters.

The disease locus, the clustering, and everything else, is explored using a Markov Chain Monte Carlo (MCMC) method. From this, the posterior probability for the location of the disease locus can be extracted, and association can be tested through the Bayes factor when comparing the HapCluster model with a null model that assumes that cases and controls are distributed independently from the genotype.

It is the latter, the Bayes factor, I plan to use in a genome wide setting. I will run the method in sliding windows along the genome, and for each window extract the Bayes factor.

How did I slow it down?

With the help of two students, I refactored the code to make it easier to extend the method. I had made an unholy mess of template programs to mix different likelihood methods and different types of data without paying a runtime penalty, but that code just turned out too hard to modify, so we re-wrote the code to use dynamic virtual function dispatching (taking care to limit how often we called virtual functions so we wouldn’t slow the code down too much).

It did slow the code down a little bit, but not enough to be a major problem. But then I added code to automatically detect convergence of the Markov chain, and to adjust the number of iterations of the chain to always get the same effective sample size. This is necessary if I want to deal with genome wide data, where I cannot manually examine the output of each run, but now I have a time problem. In some data sets, it takes ages for the diagnostics to accept that the chain has converged.

How do I speed it up?

It seems to me that I have two ways to speed up the method: Make each iteration in the Markov chain faster, or take fewer steps.

Speeding each step up is not as easy as it sounds. I am spending around 80% of the total runtime in the same function, which is the function that compares the local sequences. If I could just speed that function up, I would gain a lot, but the code there is pretty tight so I don’t see any obvious ways of doing it. I am playing with the idea of re-writing the data structure I have the sequences in, so I can exploit parallel instructions on the newer chips, but it is a major change to the tool that might not give me much in the long run, so I will keep that idea for a thesis student some day and not bother with it myself :)

Taking fewer steps is not immediately a good idea. I need the chain to have converged before I can start sampling from it, and I really do need to get the right number of effective samples to get a reasonable result. If the problem is that on some data, the chain takes longer to converge, there is really not much I can do about it.

But I have a suspicion that the chain really has converged and it is just my diagnostics that gets confused. The way I determine convergence is somewhat ad hoc as all such methods are. I sample the numerical parameters I have in the model and compare the first half of the samples with the second half of a chain run for a fixed number of iterations. If a t-test says they have different means, I assume that the chain didn’t converge, but if the t-test says they have the same mean, then I accept convergence.

Now, I have noticed that when the chain takes a larger number of iterations before the diagnostics accept convergence, then the number of samples needed to get a fixed effective number of samples is very high. This means that in the cases where the burn in takes a long time, I have a higher auto-correlation between samples.

If the chain moves slowly through the state space, then even if it has converged I will probably see different means between the first and second half of a run.

My idea now is to increase the thinning between samples, based on estimates of the effective samples size, during burn in. It means that I will increase the number of iterations in some cases, but if my theory is correct, I will not run into the problems where I have converged but my test keeps rejecting this.

Let’s see how it works…