Posts Tagged ‘software’

GSL and C++

Tuesday, May 27th, 2008

Yesterday, in my machine learning class, I showed some example code for artificial neural networks. I am giving my students a project tomorrow where they have to do regression using neural networks, and I wanted to show them how to program these networks. The pictures we draw of networks, and the mathematics we do with them, looks less like the implementation than the other machine learning methods we cover in the course, so I thought a code demo would be in order.

The project

It’s a simple feed-forward neural network I want them to implement in their project, so essentially they just need to be able to calculate target values from predictor variables and the network weights and then minimise the error function (as a function of the network weights).

target = y(x,w)
training: argminw Σn(tny(xn,w))2
Using the back-propagation algorithm they can get the gradient of the error function with respect to w and use that in the optimisation.Now, I could ask them to use the gradient descent algorithm or similar for the optimization, but this is not a course on numerical methods (even simple ones) and I don’t want to give them the impression that numerical optimization is simpler than it really is, so I suggest to them that they use a library for the optimization.

Example implementations

For class I showed how simple neural networks could be implemented and trained in Python, using scipy to do the training. This works very fine for the demonstration, since I can interactively train the networks to various target functions and plot how well the network fits the target.

To show that the language is more or less arbitrary for the solution, I also implemented a neural network in C++, this time using GSL to minimize the error function.

In theory, there shouldn’t be much difference between the Python and the C++ solutions — and for the actual neural network implementation there isn’t — but using GSL is much more tedious than the Python approach. It really shouldn’t be.

The main problem is that GSL is a C library and not a C++ library, so the features in C++ that would make the C++ solution just as simple and intuitive as the Python solution are simply not supported. One thing is operator overloading. It is missing from C, so working with vectors is a bit annoying. Especially since, in C++, you are used to the STL vectors, and switching back and forth between std::vector<> and gsl_vector * is a bit annoying. It is not the worst problem, though. Not in the neural network application, at least (it might be for more linear algebra intensive applications).

The one thing that annoyed me the most was how I should call the minimization function. In Python it was reasonably easy using high-level functions, so I could curry my error function and create a function of just the weights. In GSL, I had to go through a void pointer. In C++, I would have used a functor to deal with this, but since GSL is C it doesn’t have templates, so I couldn’t do that.

I wasn’t happy with resource management either. GSL is clearly not designed with exceptions in mind — after all, they do not exist in C — and this means that you either have to wrap all GSL types in small specialized classes so they will be destroyed properly, or you need to be careful with calling the correct <type>_free() functions. The later is very hard to do exception safe.

C++ wrappers for GSL?

I googled around a bit for C++ wrappers for GSL, and it seems like there are a few around, but not really any I found particularly attractive. GSL for C++ doesn’t appear particularly active (but I could be wrong). GSLwrap seems like it only wraps the linear algebra.

Are there any good GSL wrappers around, that use the high level features of C++?  Or any obvious replacement?

Just overloading operators to provide linear algebra syntax for matrix manipulation is not the way to go, unless there is some pretty heavy template magic involved, I think.  The methods in BLAS are pretty efficient because they can do a lot of operations with a single call.  Doing the same using operators would be very inefficient.  That is why the efficient C++ solutions use template expressions.  I am not sure that there is an easy way to do this through GSL, although the template expressions could end up just calling the BLAS routines, of course.

What about high level functions (functors) for optimization?

Software decay and software repositories

Thursday, May 22nd, 2008

bbgm suggests:

In essence this expands on the issue that I have been raising lately; that academics should use code repositories like Google Code, Sourceforge or Github. That not only moves some of the issues with code maintenance infrastructure and utilities out onto the cloud, it also brings in the ability of a bigger user base, ability to access mode more easily, etc.

Will this solve the problem of URL decay mentioned in the latest issue of Bioinformatics?

URL decay in MEDLINE — a 4-year follow-up study

Jonathan D. Wren Bioinformatics 2008 24(11):1381-1385; doi:10.1093/bioinformatics/btn127

Abstract

Motivation: Internet-based electronic resources, as given by Uniform Resource Locators (URLs), are being increasingly used in scientific publications but are also becoming inaccessible in a time-dependant manner, a phenomenon documented across disciplines. Initial reports brought attention to the problem, spawning methods of effectively preserving URL content while some journals adopted policies regarding URL publication and begun storing supplementary information on journal websites. Thus, a reexamination of URL growth and decay in the literature is merited to see if the problem has grown or been mitigated by any of these changes.

Results: After the 2003 study, three follow-up studies were conducted in 2004, 2005 and 2007. Unfortunately, no significant change was found in the rate of URL decay among any of the studies. However, only 5% of URLs cited more than twice have decayed versus 20% of URLs cited once or twice. The most common types of lost content were computer programs (43%), followed by scholarly content (38%) and databases (19%). Compared to URLs still available, no lost content type was significantly over- or underrepresented. Searching for 30 of these websites using Google, 11 (37%) were found relocated to different URLs.

Conclusions: URL decay continues unabated, but URLs published by organizations tend to be more stable. Repeated citation of URLs suggests calculation of an electronic impact factor (eIF) would be an objective, quantitative way to measure the impact of Internet-based resources on scientific research.

It certainly seems like we are loosing our data and programs, so some larger repositories might be the way to go…

Debugging fun

Saturday, May 10th, 2008

Yesterday I got a simulator from Garrett Hellenthal that I need to simulate some genetics data (combined “panel” and “tagSNP” data, if you want to know).

The program didn’t work for me. On the example data provided with the code, it complained about invalid input. Right! Naturally, I expected that Garrett had given me test data that was in some way outdated. It happens quie often that the test data isn’t updated when new features are added to a program, and that test examples no longer function the way they should. I make that mistake a lot, anyway.

So I got hold of Garrett, and yes the test example was incorrect and he sent me an updated test. It didn’t work either. We emailed back and forth a bit, and this time it was a bit of a puzzle. Everything worked fine at Garrett’s end, but the program just wouldn’t eat the input on my computer. I didn’t have the time to figure out why, yesterday, as I had a plane to catch, but I had another look at it today.

I first grep’ed for the error message I got: “CV frequencies must be between 0 and 0.50″. That line occurred a few places in the code, so I annotated the code with some output before each and that pinpointed this test as the problem:

if ((allelefreq[count] < epsilon[count]) || (allelefreq[count] > (0.50-epsilon[count]))) {
   printf("CV frequencies must be between 0 and 0.50\n");
   exit(1);
 }

Ok, so on Garrett’s machine, the test was false and on mine it was true, but the input files were the same. First I tested if the input was read in correctly (Garrett’s IO code is a bit of a mess, excuse me for saying it). allelefreq[count] was supposed to be 0.4 — and it was — and epsilon[count] should be 0.1 — and it was.

That’s when I spotted the problem. Do you see it?

Clearly 0.4 ≥ 0.1 and 0.4 ≤ 0.5-0.1 so the test should evaluate to false, but it doesn’t.

The thing is, floating point numbers aren’t real numbers (as you probably know). And for one thing, 0.1 cannot be represented in a finite number of bits (as a binary number) so it is approximated in the computer. So 0.5-0.1 is not 0.4 (although they could be represented by the same bit-patter as 0.4 is represented as). Whether 0.4 ≤ 0.5-0.1 or 0.4 > 0.5-0.1 on the computer depends on the way the numbers are represented.

Garrett was running his program on a 64 bit architecture (where the numbers were represented using 128 bits) and I on a 32 bit architecture (with 64 bits per number). That made all the difference.

Fun, eh?

We need faster machines…

Tuesday, April 22nd, 2008

I missed this post earlier, but it is a short little story of things to come … well, it is already happening at Washington University where they are at the forefront of genome sequencing, but the rest of us will feel it soon enough.

With new sequencing technologies, we are creating massive data sets, and the IT infrastructure is becoming a bottleneck.  In my own work — association mapping and genetics — we have already been feeling this for a while, ’cause we are using quite time consuming analysis methods — but with the vast increase in data even the simplest analysis could be a problem.

Apparently, 1600 cores is not enough at WashU.

Computer scientists are becoming more important for biology, I guess, and it is time for those algorithmics guys to get cracking.

Preview of CLC Genomics Workbench

Thursday, April 10th, 2008

I’m not planning on turning my blog into a commercial for CLC Bio (with whom I have no affiliation at all, trust me), but I’d like to show this video showing a preview of their new software, to be released shortly:

It’s not something I think I will be using myself — I do mainly theoretical work and on the rare occasions where I have to analyse “real data” I usually need custom software anyway — but the presenter, Mikkel Nygaard Ravn, is an old friend from my computer science days. I haven’t seen him for ages so it was fun watching him on youtube and I just had to link to it.