Posts Tagged ‘coalHMM’

Insignificant, but significant, differences

Friday, May 1st, 2009

I have this HMM that lets me estimate speciation times and effective population sizes.  Part of the CoalHMM work we are doing at BiRC, but a completely new model that Julien and I just implemented over Easter.

I am using it to analyse the two orangutan subspecies and there is a slight problem that I am checking to see if it affects the results.

In simulations we have seen that the estimates can be biased, but in general if we have enough states in the HMM the bias shrinks (we can vary the states, that corresponds to coalescence times, so the HMM is really a class of HMMs with different number of states).

For my analysis, I have tried running the HMM on chromosome 22 with different numbers of states to see how it affects the parameter estimates:

The estimates are independent estimates for 1Mbp "chunks" along the genome, and clearly there is a consistent difference in the estimates when varying the number of states.  You can test this with a paired t-test or just look at plots of the differences in the estimates.

So the difference is statistical significant, and except for the recombination rate it seems that increasing the number of states changes the estimates in the same direction.

Compared to the variance in the estimates along the chromosome, though, the differences between using different number of states is tiny.  Completely insignificant, just not statistical insignificant.

Not sure what to make of that, except that the number of states probably does not have much affect on the final analysis results.

--

121-137=-16

Crazy busy week

Friday, April 17th, 2009

Oh man, this week has been too busy even for my taste.

I really was supposed to be working on a project that I am going to London to finish in a week, but a genome project got in the way.

We are involved in a few genome sequencing and analysis project at BiRC where we use our CoalHMM methodology to infer speciation times and ancestral effective population sizes.

The method uses changes in genealogy topology to infer this, but essentially ignores the variation in coalescence times along the genomes and just uses the mean sequence divergence.

Back when I was writing about how the coalescence process leads to such variation in divergence, see:

I also developed an approximation to the coalescence with recombination process that lets us estimate the distribution of coalescence times and fragment lengths.  I'll write about how it works later, I promise, but it takes a bit explaining so I need to find a day where I'm not over worked, so it will be some upcoming vacation or something...

Anyway, because we can compute these distributions we can also construct a hidden Markov model that uses them in its transition matrix, and over Easter, Julien implemented this HMM.  Our initial simulation-based validation is very promising, so now we are analysing a few genomes.

The method is relatively fast, so we can analyse a genome in about a CPU week on our Linux grid.  Still, because we are adding a new method and completely new results very late in a project, we need to move fast to get the results into the paper.  Since I am away in the UK at the end of the month, and Julien is away for a week of well deserved holiday and then job hunting in France, we need some convinsing results this week already.  Thus a very busy week.

That is why the blog has been very quiet lately.  No excuse, of course, but an explanation.

Ok, off I go again.  I need to finish making slides for a talk I'm giving this afternoon (on genomic analysis of abes and the issues mentioned above...)

--

107-130=-23

Damn my lousy programming skills

Thursday, April 2nd, 2009

I just found out that the experiments I used to figure out the feasibility of analysing the CoalHMM runs through a database were buggy.  I noticed that the query time once we started filling in actual data in our database were very inconsistent with the results I got earlier, so I started reading through the prototype code and comparing it with the actual database code.

It turns out that I had a bug in the prototype that speeded the query time up a lot, but gave the wrong results.  The actual code, that was properly tested, doesn't have the bug, but the query time is much slower.

I guess I have to do the feasibility analysis all over again...

--

92-114=-22

Building a database of CoalHMM results

Wednesday, March 25th, 2009

In our CoalHMM work, we analyse whole genome alignments, and for each column in the alignment we essentially get the probability of being in different gene trees.  We want to look at the correlation of various genomic features (like gene density, conserved region, etc.) with the gene tree distribution.

To do this, we are building a MySQL database, so we can easily pick out the results for the alignment columns with various features.

One problem here is that we easily have 1-2 G of columns, and each column corresponds to different positions in the genomes we consider.  Plus, they are not contiguous in the genomes, even if they are in the aligment, because of various structural changes.

Since the genomic features are in genome coordinates (chromosomes plus positions on the chromosomes) we need a mapping from genomic coordinates to alignment columns.  With gigs of rows for both kinds of coordinate systems.

So we have a table with the probabilities for each column in the alignment, plus a table for each species mapping genomic coordinates to alignment columns.  The trick is now to have fast queries on the genomic coordinates to get the alignment columns.

I've been playing around with this, this afternoon.

Query times without indexing

With just plain tables, with no indexing on the columns, we have the worst case behaviour.  Before trying out any optimisations, I just wanted to figure out how slow queries would be.

So I tried out different alignment sizes, varying from 100K to 1M, and distributed these on chromosomes (of equal size) varying in size from 10K to 100K.

I then timed the queries for random chunks of 1K contiguous positions.  The results are plotted below:

The time doesn't seem to depend much on the chromosome sizes, but depends significantly on the alignment size.

To get a feeling for the query time if I moved to a 2G alignment, I simply fitted a line to the query time (the dashed line in the top plot) and extrapolated from that.  The time came out as ~1400 seconds, which is clearly too slow for any serious analysis.

Query times with simple indexing

Clearly some indexing of columns is necessary, so I tried out the simple solution of putting an index on the chromosomes and on the chromosomal positions.

Results below:

This, not so surprisingly, improved the query time a lot, but extrapolating from the regression time still gives me ~12 seconds query times if the alignment is 2Gbp.

Query times with spatial queries

So I pulled out the old chestnut of spatial queries, and represented the genomic coordinates as two dimensional points (the chromosome on one axis and the chromosomal position on the other).

This time, the results came out as:

The regression line this time show no significant dependency on the alignment size (P=0.6), so if this is true I can extrapolate the query time for the 2Gbp alignment just from the mean query time from the smaller alignments.  Even if there is some dependency, I can still be luck and have a query time of fractions of a second.

Sweet.

Remaining problems

The only annoying thing with this solution is that I have to construct geometric structures as strings in scripts, like this:

def _region2poly(c,start,end):
    return 'GeomFromText("POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))")' \
            % (c-.1, start-.1,
               c+.1, start-.1,
               c+.1,end+.1,
               c-.1,end+.1,
               c-.1,start-.1)

This means that some of the logic in the database is not actually stored in the database, but in scripts accessing.  Especially annoying since I plan to access it from both Python and R (and possibly C++ where I'm thinking about writing a Qt application for interactive viewing of results).

That means a lot of duplicated code, which is never a good idea, and is bound to bite me in the bum at some later time.

Is this something that can be done with MySQL procedures?  I have never used those, and am really a novice when it comes to databases, so I really have no idea...

--

84-104=-20

On segment lengths, going back in time, in the coalescence process. Part 2: The ancestry of two species

Friday, February 27th, 2009

Previously I wrote about the coalescence process when tracking a single lineage back in time.  Today I will write a bit about the process when dealing with two sequences, and the consequences when comparing the genomes of two species.

I will start out with considering the plain old coalescence process, when we consider two sequences back in time, and at the end move to the process when we consider sequences from two different species.

The ancestry of two sequences

In Wiuf and Hein (1999) they considered the coalescence process with recombination of \(k\) sequences, and how it distributes segments on different most recent common ancestors (MRCAs).  Here, I will just consider \(k=2\).

If you simulate the coalescence process for two sequences, you will -- just as for the single sequence -- see sequences break up, due to recombination events, and merge again, due to coalescence events.

We can then ask: When do two nucleotides, one for each initial sequence, reach their MRCA?  How many MRCAs do we eventually end up with? How many nucleotides share the same MRCA?  How long are the contiguous segments on an MRCA? etc.

Below I have plotted the coalescence times, where each nucleotide pair from the two sequences, reach their MRCA.  The simulations were run with \(N_e=10000\), sequence length \(L=100K\) and per nucleotide recombination rate \(r=10^{-8}\).

You cannot see this from the plot, but when you see two segments with the same \(y\) value, they are actually sitting on the same MRCA.  That doesn't mean that the segments on the same MRCA are contiguous.  Just to the left of the middle of the plot you see two long segments with the same MRCA, but with a short segment between them that coalesced slightly earlier.

You might notice that the smaller the coalescence time, the longer the segments.  This is generally the case; the longer you go back in time, the greater that chance that a recombination event breaks up a segment.  The broken up segments will, of course, coalesce again, but initially the coalescence events are relatively rate because there are few lineages.  The recombination events are also rare, per nucleotide, but since we are tracking a lot of nucleotides back in time, they still occur quite frequently.

Once recombination has broken the initial sequences into many fragments, the coalescence rate is higher and we see a lot of recombinations, but by now the fragments are already much shorter.

The long fragments that coalesce early are of course still part of the recombination/coalescence process, but since today we only track them up to their MRCA we don't see that in the plot.

Wiuf and Hein considered, among other things, the number of MRCAs, the expected length between recombination breakpoints (whether they break up segments on the MRCA or not; it is still possible for two segments to re-coalesce and end up on the same contiguous MRCA segment), and the expected lengths of the contiguous segments on the MRCA.

The expected number of segments

The expected number of segments carrying ancestral, \(S_k\) material was already derived by Griffiths and Marjoram in 1997, and is given by:

\[E[S_k(\rho)] = 1 +\rho\left( 1 - \frac{2}{k(k+1)}\right)\]

where \(k\) is the number of initial lineages (two in the simulation above), and \(\rho=4N_e(L-1)r\) is the scaled recombination rate.

This clearly grows both with the number of lineages \(k\) and the recombination rate \(\rho\), but towards an equilibrium for \(k\) and unbounded for \(\rho\).

The expected number of MRCAs

Wiuf and Hein didn't derive an expression for the expected number of MRCAs, but used simulations to compare the expected number of segments, \(E[S_k(\rho)]\) to the expected number of MRCAs.  See the figure on the right.

Since segments can share MRCA, it is not surprising that we tend to see fewer MRCAs than segments, but notice that the ratio does not tend towards zero but a constant that is higher for small \(k\) and lower for larger \(k\), with the difference decreasing as \(k\) increases.  So the number of initial lineages does not have a strong effect on the number of MRCAs, but the recombination rate clearly does.

When the scaled recombination rate increases (either because the actual recombination rate increases or because we consider longer sequences), the number of MRCAs increases proportionally with the number of segments (and with \(\rho\)).

This might trouble you a bit, since for sufficiently long sequences, you can end up with far more MRCAs than you have \(N_e\).  Not that it is a problem that you have more individuals than \(N_e\) -- after all, the effective population size is just a parameter of our model and is usually quite a bit smaller than the actual population size -- but an important assumption in the coalescence process is that you are modelling the ancestry of a sample much smaller than \(N_e\).  With recombination, even for a small initial \(k\) you can end up with too many lineages for this assumption to be true.

Not to worry.  As we showed in Davies et al. (2007) this doesn't actually affect local properties like segment lengths (but does, of course, have a large effect on number of MRCAs).

The expected length between recombination points and the expected segment length

Wiuf and Hein derived expressions for the expected length between recombination points:

\[E[r_k\,|\,B_k(0)=1] = \frac{1}{2\sum_{i=1}^{k-1}1/i}\]

and the expected segment lengths at MRCAs:

\[E[s_k\,|\,A_k(0)=1] = \frac{1}{2-4/k(k+1)}\]

sorry for the reuse of \(r\) here, but this is the notation from Wiuf and Hein (1999).

Don't worry too much about the \(B_k(0)=1\) and \(A_k(0)=1\), that is just their way of saying that they are measuring the length to the next recombination point / break point from the a given recombination point / break point.

Notice that both lengths are independent of the recombination rate.  This can't be true, of course, and it isn't.  The reason that the expressions are independent of the recombination rate is just that the lengths are measured in units of \(\rho/2\), so really the expressions should be

\[E[r_k\,|\,B_k(0)=1] = \frac{1}{2\sum_{i=1}^{k-1}1/i}/\frac{\rho}{2}\]

and

\[E[s_k\,|\,A_k(0)=1] = \frac{1}{2-4/k(k+1)}/\frac{\rho}{2}\]

Below I compare the expected segment lengths, \(E[s_k\,|\,A_k(0)=1]\) (in red) with the mean segment lengths from simulations (in blue).

I am not sure exactly why the fit is so bad for the small recombination rates, but this could be an artefact from the simulation setup since 1) the simulations work with discrete nucleotides while the mathematics works with continuous segments and 2) there is an upper bound to the possible segment lengths in the simulations which means that really long segments are not possible.

The relative error (average-expected)/expected approaches zero as the recombination rate increases.

The ancestry of two species

If we consider two sequences from different species, the situation changes a bit from the above.

If we start out with two sequences in different species, the segments cannot coalesce into MRCAs initially, but still fragment due to the process running within the species branches, as I explained in my previous post.

This means that the two sequences enters the process described above, not as contiguous segments but as several fragments.  The number of fragments and the length of fragments depend on how far back in time the speciation event occurred.

Below I plot single simulations of the process where the speciation event is 0, 1, 5, and 10 times \(2N_e\) generations back in time.

In general, we see that the number of fragments increases and the mean fragment length decreases as we push the speciation event back in time, with the effect reaching an equilibrium as the process in the single species approaches equilibrium.

The expression from Wiuf and Hein (1999) will thus tend to overestimate the mean segment length in this setting.

The plots below compare the predicted mean length, using the formula above, and the average segment length for different speciation times:

(the title is wrong on those, the actual speciation times are 0 generations ago, 2Ne generations ago, 6Ne generations ago, and 10Ne generations ago -- but I don't have time to fix that right now as I am late for teaching...)

The comparison is summarised below:

I do not have any expressions for the expected segment lenths in this setting (but I hope that Wiuf can help me work those out).

In my next post in the series, however, I will use a two nucleotide CTMC to approximate some properties about the system, similarly to what I did with the single species case.

References

  1. Carsten Wiuf, Jotun Hein (1999). The ancestry of a sample of sequences subject to recombination (PDF) Genetics, 151, 1217-1228
  2. J. L. Davies, F. Simancik, R. Lyngso, T. Mailund, J. Hein (2007). On Recombination-Induced Multiple and Simultaneous Coalescent Events Genetics, 177 (4), 2151-2160 DOI: 10.1534/genetics.107.071126

--

58-77=-19