Posts Tagged ‘useful tools’

Keynote talk on coalescent hidden Markov models and great ape speciation

Monday, June 9th, 2008

Wednesday this week I’m giving a talk at the Danish Society for Computer Science. I was asked to give the keynote talk at this years general assembly before they hand out this year’s Best Thesis Award.

I’m tired of talking about my main project — association mapping — so I decided to give a talk on our CoalHMM work (although I am only working on that a small percentage of my time).

The slides file is too big to upload to slideshare, but if my flash hack works you should see them below:

You can also get it as a PDF file, a PowerPoint file or a Keynote file, if you want.

The last couple of presentations I’ve made I have been using the Keynote program on my Mac. I’m really happy with the program. It makes it very easy to put together pretty presentations (if I say so myself) with very little work. Especially its handling of graphics impresses me. Compared to using OpenOffice, as I usually did before getting the Mac, there’s essentially no work involved in including graphics. Automatic masking and alpha channels beats modifying images in Gimp any day!

I couldn’t really figure out how to put presentations on my homepage, though. Keynote files are really directories, so you cannot just copy them to the homepage directory. I found out, though, that if a Mac downloads a zip’ed Keynote file it will just open it in Keynote, so I guess that is the way to do that.

Programming for (non-computer) scientists

Friday, June 6th, 2008

A few days ago, bbgm wrote about Programming and Science Education and how people advocating that students use Excel (or similar) instead of “real” programming languages.  I quote:

As a computational scientist (and with both a physical and a life science background), that such arguments still happen is appalling. IMO, all scientists, especially those remotely connected to theory and/or computational science should be given the opportunity to learn some formal software engineering and computer science principles (for physical chemists, bioinformaticians, etc is should be mandatory to do some courses).

I guess I agree, but I haven’t made up my mind to which degree I agree.

How much programming do you really need?

You can do a lot with spreadsheets and they are a really powerful tool once you get used to them.  I have never bothered myself, but I have seen what my colleagues manage to do with them.  So they may be sufficient for what you need and then it makes sense to learn how to get the most of them.

That being said, it is not a substitute for a programming language, and when you need a proper programming language you need a lot of hacking to do the same tasks in a spreadsheet.

If you need a complex statistical analysis of your data, you are better off using a language/environment such as R than trying to do the analysis in Excel.  Sure, Excel can do a lot of statistics, but only the most common types of analysis. If you need more, you need something like R.

Depending on how much manipulation of your data — and how much mathematical modelling you need to implement yourself — R might not be enough.  If you need to combine a lot of output of various programs, do a few manipulations on the output, and analyse that, you probably want to consider a scripting language like Python or Perl.

For heavy duty scientific programming, you want MATLAB or Octave or such.  When you need the extra speed, and your language does not already have an optimised solution, C or C++ is useful (either for a full application or for a module in your scripting or statistical language — they usually will allow you to make extensions in C/C++ that you can then use as any other module).

How much programming you need to do in scientific work varies a lot and there is not really any reason to spend time learning something that you will never use.

Maybe a spreadsheet is all you will ever need.  Just make sure that you are doing the analysis you want to do, not the one your tools limits you to, and if you need more, then spend the time learning how to program!

Learning how to program

Remember, though, that learning to program a computer takes time.  It can take a lot of time before you do it well.

There is a lot more to it than learning the syntax of a programming language.  You need to know the right way to solve given problems in your language of choice — the optimal way in one programming language can be very different from the optimal way in another, depending on libraries, language features, underlying philosophy of the language, etc. — and in general you need to learn how to think like a programmer.

It might be a different kind of problem solving skills than what you are used to.

Most of all it takes practise.  This can be frustrating if you pick up a language to solve a specific problem that you are more interested in than learning how to program.

If you only very rarely need to program, don’t bother.  Find someone to help you out.  This is how we bioinformaticians get involved in various projects so although we complain about being over worked, we don’t really mind so much. It is not different from having to consult a statistician from time to time to make sure that you are analysing your data correctly.  If you do not need it that often, your time might be better spend on other tasks.

If you often need to solve programming problems, you should learn how to do it properly.  You get a better feeling for the problems and for the data when you work with it, and you don’t want to out-source that to other people.

If you decide you need programming in your research — and I think more and more sciences rely on IT so maybe you do before you know it — then practice practice practice.  Experiment with your programming language.  Read discussion fora and user groups.  Read other peoples code and see how they solve similar problems.

Just as it takes time to learn how to use a spreadsheet properly — and just learning all the features it offers — it takes time to learn how to program.  Usually more so, as programming languages are much more powerful.  Spend the time!  There is no fast way around this, there really isn’t.

GSL and C++

Tuesday, May 27th, 2008

Yesterday, in my machine learning class, I showed some example code for artificial neural networks. I am giving my students a project tomorrow where they have to do regression using neural networks, and I wanted to show them how to program these networks. The pictures we draw of networks, and the mathematics we do with them, looks less like the implementation than the other machine learning methods we cover in the course, so I thought a code demo would be in order.

The project

It’s a simple feed-forward neural network I want them to implement in their project, so essentially they just need to be able to calculate target values from predictor variables and the network weights and then minimise the error function (as a function of the network weights).

target = y(x,w)
training: argminw Σn(tny(xn,w))2
Using the back-propagation algorithm they can get the gradient of the error function with respect to w and use that in the optimisation.Now, I could ask them to use the gradient descent algorithm or similar for the optimization, but this is not a course on numerical methods (even simple ones) and I don’t want to give them the impression that numerical optimization is simpler than it really is, so I suggest to them that they use a library for the optimization.

Example implementations

For class I showed how simple neural networks could be implemented and trained in Python, using scipy to do the training. This works very fine for the demonstration, since I can interactively train the networks to various target functions and plot how well the network fits the target.

To show that the language is more or less arbitrary for the solution, I also implemented a neural network in C++, this time using GSL to minimize the error function.

In theory, there shouldn’t be much difference between the Python and the C++ solutions — and for the actual neural network implementation there isn’t — but using GSL is much more tedious than the Python approach. It really shouldn’t be.

The main problem is that GSL is a C library and not a C++ library, so the features in C++ that would make the C++ solution just as simple and intuitive as the Python solution are simply not supported. One thing is operator overloading. It is missing from C, so working with vectors is a bit annoying. Especially since, in C++, you are used to the STL vectors, and switching back and forth between std::vector<> and gsl_vector * is a bit annoying. It is not the worst problem, though. Not in the neural network application, at least (it might be for more linear algebra intensive applications).

The one thing that annoyed me the most was how I should call the minimization function. In Python it was reasonably easy using high-level functions, so I could curry my error function and create a function of just the weights. In GSL, I had to go through a void pointer. In C++, I would have used a functor to deal with this, but since GSL is C it doesn’t have templates, so I couldn’t do that.

I wasn’t happy with resource management either. GSL is clearly not designed with exceptions in mind — after all, they do not exist in C — and this means that you either have to wrap all GSL types in small specialized classes so they will be destroyed properly, or you need to be careful with calling the correct <type>_free() functions. The later is very hard to do exception safe.

C++ wrappers for GSL?

I googled around a bit for C++ wrappers for GSL, and it seems like there are a few around, but not really any I found particularly attractive. GSL for C++ doesn’t appear particularly active (but I could be wrong). GSLwrap seems like it only wraps the linear algebra.

Are there any good GSL wrappers around, that use the high level features of C++?  Or any obvious replacement?

Just overloading operators to provide linear algebra syntax for matrix manipulation is not the way to go, unless there is some pretty heavy template magic involved, I think.  The methods in BLAS are pretty efficient because they can do a lot of operations with a single call.  Doing the same using operators would be very inefficient.  That is why the efficient C++ solutions use template expressions.  I am not sure that there is an easy way to do this through GSL, although the template expressions could end up just calling the BLAS routines, of course.

What about high level functions (functors) for optimization?

Maxima, a computer algebra system

Saturday, March 22nd, 2008

The last couple of days, I have been learning to use Maxima, a computer algebra system. I haven’t really used a symbolic computer algebra system for years. I used Maple the first two years when I studied math, but after that, I never really felt that I needed it. When getting my computer science degree, that was true, but now that I am doing bioinformatics I do a lot more math than I used to.

wxMaxima screenshotA couple of colleagues use Mathematica, but I am too cheap for that, so I went looking for an open source alternative. I did that a few years ago, but didn’t find anything I like (I even tried Maxima back then, but wasn’t impressed at the time). Then, earlier this week, one of the post docs at BiRC was using wxMaxima, a GUI wrapping the Maxima system, and it looked really nice, so I decided to give it another try. This time, I’m more impressed. I’m not really sure if it is because my maths skills have improved, or just that the GUI makes it much nicer to work with, though.

Now, I don’t know Mathematica and it has been ages since I used Maple, so I don’t know how it compares with those. I would imagine that its feature set is limited compared to those, but I do not really need that complicated mathematics, so I doubt it is something that will be a problem for me.

I don’t imagine that I will ever do any serious programming in Maxima, though. I will probably only use it to solve a few equations here and there and leave the serious data analysis to R and my numerical computation tasks to Octave or C++ with GSL.

Major association mapping software release

Thursday, February 14th, 2008

Today I am releasing new versions of about half my association software. It’s been a while since I released new versions of any of these tools, and in the mean time they’ve been more and more integrated making it harder to release them independently. Now, since we needed to use them all up in Iceland last we visited DeCODE — myself in December and three other from my group in January — we needed to get all the software synchronized anyway, so I wanted to take that opportunity to make a major release.

I had planned to make the release close to New Year, so I code-named it the New Year release. I think I should re-name it to the Chinese New Year release. That is close enough that I can defend it.

Of course, it is even closer to the next planned release — the Happy Birthday release — that was supposed to come out tomorrow (at my birthday, of course). That release is likely to be delayed a couple more weeks, though, but I am sticking to the code name.

By the way, you can see the road-map for that here.

The software release consists of the following:

SNPFile logoSNPFile — a library and API for manipulating large SNP datasets with associated meta-data, such as marker names, marker locations, individuals’ phenotypes, etc. in an I/O efficient binary file format. Version 2.0 adds a completely new serialization framework for storing meta-data. The previous one — based on Boost serialization — wasn’t binary compatible across platforms, the new one is. We also add a Python module for manipulation of SNPFiles, version 1.0 of that.

SMA logoSMA — tools for single marker association tests. Currently there are three tools, two for case/control data and one for quantitative traits. Version 1.2 extends the tools with options for doing both genotype and allelic (additive) tests.

Blossoc logoBlossoc — BLOck aSSOCiation. Blossoc is a linkage disequilibrium association mapping tool that attempts to build (perfect) genealogies for each site in the input and score these according to non-random clustering of affected individuals, and judge high-scoring areas as likely candidates for containing disease affecting variation. Building the local genealogy trees is based on a number of heuristics that are not guaranteed to build true trees, but have the advantage of more sophisticated methods of being extremely fast. Blossoc can therefore handle much larger datasets than more sophisticated tools, but at the cost of sacrificing some accuracy. Version 1.3 adds methods for scanning for quantitative traits and is tightly integrated with SNPFile.

HapCluster logoHapCluster — a Bayesian Markov-chain Monte Carlo (MCMC) method for fine-scale linkage-disequilibrium mapping, described in details in:

Fine Mapping of Disease Genes via Haplotype Clustering. E.R.B. Waldron, J.C. Whitaker and D.J. Balding. Genetic Epidemiology. 30: 170–179. (2006)

a tool I develop in collaboration with David Balding’s group at Imperial College London. Version 2.2 is basically just integration with SNPFile 2.0. The next major development of HapCluster is what I have planned for the Happy Birthday release.