Archive for September 19th, 2009

Profiling with Shark

Saturday, September 19th, 2009

I have absolutely no experience with profiling on a Mac.  I’ve used gprof and valgrind a lot on Linux, but now that I’ve started developing on Mac I need to learn how to profile here as well.

A bit of googling tells me that there are two nice tools for this, Shark and Instruments.  I have both installed and decided to try out Shark first, since that looked a bit easier to use.  I am also going to try out Instruments later, but my experience with Shark was pretty good.

It is a sampler based profiler, so to use it you just start your application and then start sampling.  It will sample everything running on your computer, but if your program is doing a significant amount of work it will be easy to find it in the resulting performance profile, and you can then get rid of everything else with some filters.

I actually have something I need to profile having to do with file IO, but the data I need for that is on another machine that is now busy with actual computations, so for my experiments with Shark I just tried out our CoalHMM tool on the example data distributed with the code.

I started the tool, then started the sampling, and 30 seconds later I got this profile:

Performance profileIt is pretty clear from it that there is a hotspot worth looking at (in the Bio++ NumCalc library), and looking at the code Shark nicely shows where it is:

Hotspot in the codeIt even gives hints as to what the problem could be and how to fix it.  Neat!

The hotspot doesn’t surprise me much.  The application is a hidden Markov model, and I fully expected that most of the time was spent in the Forward algorithm.  The solution doesn’t surprise me either – and we are already working on an SSE improvement.  Still, with profiling you can never be sure, so it is nice to be confirmed.

I also tried the simple fix of enabling auto-vectorization (-ftree-vectorize) and compared that solution to the one before (something Shark also makes easy).

Profile comparisonIt gives a very modest improvement, but I guess it isn’t that easy for the compiler to automatically insert SIMD instructions in code like this… I expect more from our hand-coded version where we right now get two to four-fold improvements, depending on whether we are using float or double floating point precision.

262-292=-30

This limits the usefulness of Xgrid a bit…

Saturday, September 19th, 2009

Ok, I noticed this yesterday but figured it was a configuration issue that I could deal with.  When I run jobs on Xgrid, it runs one job per CPU and not one per core, which for my current use means that I only have half the CPU power compared to manual distribution of jobs.

I read the documentation, and it is supposed to run a job per core, but something is wrong on Snow Leopard and this is apparently a know issue.

I hope this gets fixed before I have a real need for the grid.

262-291=-29