Archive for May, 2009

Updating BiRC’s web pages

Wednesday, May 20th, 2009

At AU they are trying to move all the various departments’ web pages to a common TYPO3 CMS with a uniform look’n'feel.  See examples below:

We resisted a bit at BiRC, mainly because of the price the IT department at the Dept. of Computer Science wanted for hosting it.  We were quite happy with our own Skeletonz CMS and were hosting it for free.

Anyway, the Faculty of Science moved the web group from computer science to work directly under the faculty, and with that move made hosting free for all groups under the faculty, so we decided to try it out.

Yesterday I moved some of our existing pages to the new CMS.  I find TYPO3 a lot harder to work with than Skeletonz, but I’m sure I can get the hang of it eventually.  One thing I really like about it is that it is easy to duplicate web elements between pages so you only need to edit any particular element once and it will be updated on all the pages refering to it.  I use that extensively on the page describing our various projects.

Another selling point is that the CMS will soon be integrated with PU:RE, our publication management system.  We have to report all our publications – and various other activities – to that system, so being able to extract information from that database will reduce the work in maintaining the web pages significantly.  The plugin for that integration should be released shortly, and after that, I think we are ready for the move.

139-149=-10

Last week in the blogs

Monday, May 18th, 2009

Another week is gone by.  I didn’t post so much myself, but luckily a lot of other people did, and here is my list of favorites.

Biology

Computing/programming

Copyright and patent rights

Genetics

Physics

Statistics

138-148=-10

Saturday morning physics

Thursday, May 14th, 2009

Here at AU we have something called the “physics show”.  We also have “chemistry show” and I think even a “computer science show”.  Essentially, it is a group of students who gives these “shows” where they demonstrate various physics phenomena (or chemistry or whatnot) by doing experiments on stage and explaining the underlying theory.

It is mainly aimed at high school students and they go out to the schools in Denmark and do these shows to get students interested in science.

I think it is a great idea, and I have seen the show several times and always enjoyed it.

There was even a TV version of it, running late nights or early mornings on an obscure channel, that I watched with interest.  Partly because I know most of the people doing the show, but also because I love physics (I am just not good enough at it for it to be more than a hobby so I stick to bioinformatics).

The TV show ended years ago, but now on iTunes I found something even better to watch when I’m too lazy to do any work myself.

Saturday morning physics from the Uni of Michigan.

It is nothing like the physics show here, but a series of lectures on various topics (not all of the physics, though).

I would love to see more of this; lectures on iTunes.

134-147=-13

Now how exactly was it I did that?

Thursday, May 14th, 2009

RRResearch has some thoughts about keeping records of computer work:

When I do benchwork I consistently keep pretty good notes.  I write down everything I do as I do it, on numbered and dated sheets of paper that go into looseleaf binders, organized by experiment.

But I don’t seem to be able to apply these good record-keeping habits when I’m working with computers.  Instead everything I do feels ‘exploratory’, as if everything I do is just a preliminary check to see what effect a modification will have, before I do something worth writing down.

I recognise this all too well.

It is not so much a problem when I do some exploratory data analysis.  I will have my R log to see what I actually did, and if I find an interesting pattern I know what I found and I don’t really need the history of how I got there so much.

When writing programs I don’t have the problem either.  There I have source control and bug trackers to help me.

My problem is with scripts.

I write a small script to format my data into something I can analyse.  Run a program or two on the data. Write another script to re-format the data.  A small script to pull out relevant data.  Look at that.  Then I need to just check a few things, and that is easy as another little script.

Very soon I have ten to twenty small scripts of five to ten lines each. None of them are really worth putting in version control or cleaning up or anything, ’cause it was all just exploratory anyway, but if I come back to the data a few weeks later, I have no way of reproducing what I did.

It is really horrible.

Ideally, once I know what I want to do with the data, I should clean up the pipeline, put it under version control and document it, but by then I am already done with the data analysis so I rarely bother.

Until I have to do it all again a few weeks or month later on some new data.

At that point I should really clean up the pipeline, but most likely I need to do something slightly different.  Not drastically different, but a few of the steps should be modified anyway, and depending on the results I need a few more scripts and it just spirals out of control.

I don’t really know how to solve this, I only know that what I am doing is quite sub-optimal.

134-146=-12

Widespread genomic signatures of natural selection in hominid evolution

Tuesday, May 12th, 2009

Friday last week, PLoS Genetics published a paper I’ve been waiting to read for a few weeks, since I saw a reference to it in a draft of a review paper I got by email (that paper I’ll tell you all about when it comes out).

The PLoS Genetics paper is this:

Widespread Genomic Signatures of Natural Selection in Hominid Evolution

Graham McVicker, David Gordon, Colleen Davis, and Phil Green

Selection acting on genomic functional elements can be detected by its indirect effects on population diversity at linked neutral sites. To illuminate the selective forces that shaped hominid evolution, we analyzed the genomic distributions of human polymorphisms and sequence differences among five primate species relative to the locations of conserved sequence features. Neutral sequence diversity in human and ancestral hominid populations is substantially reduced near such features, resulting in a surprisingly large genome average diversity reduction due to selection of 19–26% on the autosomes and 12–40% on the X chromosome. The overall trends are broadly consistent with “background selection” or hitchhiking in ancestral populations acting to remove deleterious variants. Average selection is much stronger on exonic (both protein-coding and untranslated) conserved features than non-exonic features. Long term selection, rather than complex speciation scenarios, explains the large intragenomic variation in human/chimpanzee divergence. Our analyses reveal a dominant role for selection in shaping genomic diversity and divergence patterns, clarify hominid evolution, and provide a baseline for investigating specific selective events.

The reason I’ve been waiting for the paper is that it concerns something I am very interested in myself, and something we are working on in our CoalHMM group here at BiRC: detecting selection by detecting variation in effective population size along the genome.

Effective population size

Okay, the concept “effective population size” is a strange beast.  It doesn’t really have anything to do with population size, except in an idealised mathematical model, but is a single parameter that incorporates various different measures such as demographics and selection.

There’s a nice introduction to it in this John Hawks post: Did humans face extinction 70,000 years ago?

As described there, one way of looking at the effective population size is to define it from the average coalescence time of two random individuals in a population.  If we look at it that way, it is clear that selection will affect the effective population size.

A site under selection, if it gets fixed, will do so much faster than a site that is neutral.  A neutral site that gets fixed does so (on average) in time linear in the effective population size, while a site under selection does so in logarithmic time (regardless of whether it is positive or negative selection, surprisingly, but of course if it is negative selection the probability of it getting fixed is smaller).

If we consider a site where mutations occur that are selected against, but these are not fixed, we still see a reduction in the time between two random individuals but for a different reason: those ancestors that were selected against do not have descendants in the present population, so the number of possible ancestors of two random individuals is smaller and when we trace their ancestry back in time, they will find a common ancestor faster.

So in any case, if a site is under selection, we expect the mean time back to a common ancestor — the effective population size — to be reduced.

To muddy the waters a little bit: effective population size also affects selection since selection is stronger if the population size is large but that is a complication best left for another day…

Recombination

Recombination has an effect on this as well.

A site under selection will have a smaller effective population size, but so will nearby sites.  The reason for this is that neighbour nucleotides are likely to have the same most recent common ancestor — and thus the same divergence — with this probability depending on the recombination distance between them.

Consequently, we expect the effective population size to decrease as we move towards a site under selection, and increase again as we move away from it.

It is this kind of patter that McVicker et al. analyses in this paper.

Results

First they identify conserved genomic regions.  These are the regions that are probably under selection, since selection is one of the forces that will conserve sequences.

They do this by running a phyoHMM on an alignment of mammals (excluding those they will analyse later on to avoid biasing the results).

They then split the genome into two classes: those nucleotides within the 10% of the genome closest to a conserved region, and the 50% furthest away.  In these two classes they look at the level of polymorphism in humans, the divergence between human and chimp, and the number of informative sites supporting a grouping of human with gorilla — with chimp as an outgroup — and those grouping chimp with gorilla — with human as an outgroup.  The latter are signs of deep coalescence resulting in incomplete lineage sorting, and signs of a large effective population size in the human/chimp ancestor.

For all measures, they find that the effective population size seems to be reduced for the 10% closer to conserved regions compared to those 50% farthest away.

Since the measures are essentially all just measures of conservation, really, that isn’t in itself much of an argument.  All it says is that there is a correlation of conservation-ness along the genome.  To compensate for this, they then normalise with the divergence to macaque and to dog.  If it is just a reduction in substitution rate that is correlated, then normalising this way — assuming that the substitution rate doesn’t change dramatically along the genome and along the phylogeny — will alleviate the effect from just the substitution rate.

After normalising, the signal is still there: the polymorphism and divergence is still reduced close to conserved regions.

Again, this doesn’t prove that selection is the cause of this pattern, but the pattern certainly matches what we would expect to see if it was selection that caused it.  The normalisation should eliminate, or at least reduce, effects that are just caused by the substitution rate, so unless we invoke some more exotic explanation for conservation and the patterns along the genome, selection is a valid conclusion.

(A) Ratios calculated using the 10% of neutral sites which are nearest to and the 50% of neutral sites farthest away from conserved segments or exons. (B) The same ratios as (A) but normalized by human/macaque (H/M) divergence to account for mutation rate variation or undetected sites under purifying selection. The distance to the nearest conserved segment or exon was determined using four different measures: physical distance, pedigree-based recombination distance [26], polymorphism-based finescale recombination distance [25] and the background selection parameter, B. B (described in the main text) is not technically a distance measure but incorporates information about the recombination rate and local density of conserved segments. Autosomal human nucleotide diversity was calculated from gene-centric SeattleSNPs PGA/EGP [20], whole-genome Perlegen [19] data, and HapMap phase II data [67]. Divergence was estimated using autosomal human/chimp (H/C), human/macaque (H/M), or human/dog (H/D) genome sequence data. HG and CG sites (where human and gorilla or chimp and gorilla share a nucleotide that differs from the other three species) were calculated using a smaller set of 5-species autosomal data. Repetitive regions were omitted from the Perlegen and HapMap analyses; additional filtering steps are described in the methods. Whiskers are 95% confidence intervals.

Now that selection is concluded to be a plausible explanation for the pattern, they fit the data to a model that explains the variation by background selection. This model shows that selection is stronger near conserved regions than farther away, consistent with the assumption that the pattern is caused by selection.

Consequences

So what does all this tell us?

For one thing, it tells us that selection is a force we really should keep in mind when analysing genomes.  Yes, yes, we probably already knew that, but the neutrality assumption is so strong in genome analysis that we rarely consider non-neutrality except for the obligatory dN/dS tests on genes.  For anything that is not a gene, we usually analyse the sequences assuming neutrality.  It is a good null model, but completely ignoring selection when analysing genomic sequences should be reconsidered.

I know, I am putting it a bit on an edge here, ’cause people are not just blindly assuming neutrality, but it is a strong null assumption and we really do not like to invoke selection unless there is strong evidence against neutrality.

Another consequence is for sequence divergence.

We estimate species divergence (time of speciation events) from sequence divergence.  More often than not we equate sequence divergence with specises diverergence, but really we shouldn’t.  Even under neutrality this isn’t true, since the coalescence process of sequences is such that the sequences are further apart than the species, but for neutrality at least this patter is random along the genome.

There is still some correlation along the sequence of divergence time, under a neutral coalescence model, but at least this correlation drops off rapidly with (recombination) distance and it is not correlated with other genomic features (except in the sense that the substitution rate depends on these features).

With selection working its magic on a genome scale, the patterns of sequence divergence gets a lot more interesting.

All of this is not really a new insight.  People working with e.g. Drosophila have known this for decades, but it has been ignored in more papers than I care to mention, and perhaps it is time we stop doing this.


McVicker, G., Gordon, D., Davis, C., & Green, P. (2009). Widespread Genomic Signatures of Natural Selection in Hominid Evolution PLoS Genetics, 5 (5) DOI: 10.1371/journal.pgen.1000471

132-145=-13