Archive for the ‘Work’ Category

GOTO: HELL

Saturday, April 13th, 2013

So, about a year, maybe two, a professor from History of Ideas wanted to create an interdiciplinary centre for human evolution at AU. It wasn't funded but during the grant writing phase I wrote a bit for it and thought it would be a great idea.

It wasn't funded, as I said, but when looking at the plans for the project I realised that funding wasn't necessary for most of what was intended for the centre.

Essentially the idea was to get people at my university together to talk human evolution. Anyone who is interested in human evolution and human prehistory. I am 100% behind that.

At BiRC we think we are on the beat about human evolution but there are so many aspects of it that we don't think about just because we don't know about it. It is the unknown unknowns.

It would be really great to get together people from different groups to chat. If nothing else just to chat.

There's a guy at the university hospital who's written books (that I have really enjoyed reading) about human evolution. Why don't I chat with him often? Because I don't have the opportunity!

So when funding wasn't given I emailed Peter Kjaergaard (the guy who applied) and asked why we didn't just do the project anyway without funding.

Turns out there is no reason, so we started the project without funding. We just email people we think are interested and meet for lunches or such before or after seminars we think will be interesting.

Peter then had the great idea of holding our own lecture series. Informal meetings where we have lunch and talk about what one of our own is doing. We called it Human Evolution Lunch Lectures (although he apparently thought the acronym was too HELL'ish so on the homepage they are now called Human Evolution Lunch Seminars). That has to change because as everyone knows you go to hell if you believe in evolution.

We've only had one of those, but it was great. A talk about the evolution of our fascination of horror movies. It turns out that the humanities have something to add to our entertainment and I for one am happier to fund it now.

Besides the HELL part we have just gotten together once a month to have lunch and chat. I didn't know, I really didn't know, how much relevant work is done at my own university! Departments I have never heard about are doing work very related to what I am doing, just in very different fields of research.

I don't know how to describe what we are doing, but I am very happy that we are doing it!

Efficiently handling largish genetic data sets

Sunday, April 7th, 2013

Okay, so I'm working on this dataset containing called variants from about 20 sequenced animals from two species and maybe a pair of hybrids. It's not an enormous data set, but there is about 12 million markers and when I load it into R without being clever about it, it doesn't fit into RAM on my desktop machine.

That tells me two things: 1) I need to get a new desktop computer (hell, the laptop I'm writing this blogpost on has more memory than my desktop computer) and 2) I need to be more clever in how I manage data (because the next data set I have in my pipeline has hundreds of individuals from several species and a lot more variable sites; upgrading the computer will not save me there).

There is nothing of what I'm doing right now that I can't simply do by reading line by line through a VCF file, and that is how I've solved my problems so far. It is just pretty time consuming and I was hoping I could come up with something that would be easier and faster.

Speed is important up to a point. It would be nice to be able to work with the data interactively, but as long as I can handle my scans in a few hours on our computer cluster it is fast enough. Ease of use is more important, because while I can deal with waiting a day for a computation to finish I find it much harder to set aside a day for programming the scripts for the computation.

With R data.frames I find it pretty easy to pull out most information for an easy computation. With the foreach and plyr packages it is even easier. It isn't that hard doing it in Python either, although there is usually a bit more IO code for parsing up files and writing the output to files so I can load it into R afterward.

I know there are packages for handling large data sets in R, but since I would need to come up with a new solution for what I'm doing anyway I thought I might try out a few things and maybe come up with a solution that is easy to work with from both Python and R (since I often use both to manipulate the data I'm analysing).

Using an SQL database for genetic data

The data I have might be large when I have to keep it in RAM but it isn't large compared to most databases, so I figured I could try an SQL solution. SQLite would be my preferred solution since it doesn't require a server (and I don't have access to my server from my computing grid) but I have played around with both SQLite and MySQL this weekend.

I'm pretty sure I'm doing something wrong, though, 'cause my code is running very slow. Slower than if I just scanned through flat text files at least.

I've set up the database like this (here in MySQL syntax):

-- Drop the tables, we assume we are starting from scratch
DROP TABLE IF EXISTS DogDerivedCounts;
DROP TABLE IF EXISTS Genotypes;
DROP TABLE IF EXISTS Dog;
DROP TABLE IF EXISTS Markers;

-- Create the three main tables, reading the data in from files
CREATE TABLE Markers (
  marker_id INTEGER NOT NULL PRIMARY KEY,
  scaffold TEXT NOT NULL,
  POSITION INTEGER NOT NULL,
  ref_allele TEXT NOT NULL,
  alt_allele TEXT NOT NULL,
  qual_score FLOAT NOT NULL)
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Markers.txt' INTO TABLE Markers;

CREATE TABLE Dog (
  marker_id INTEGER NOT NULL PRIMARY KEY,
  chromosome TEXT NOT NULL,
  POSITION INTEGER NOT NULL,
  ref_allele TEXT NOT NULL,
  FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Dog.txt' INTO TABLE Dog;

CREATE TABLE Genotypes (
  marker_id INTEGER NOT NULL,
  individual TEXT NOT NULL,
  ref_read_count INT,
  alt_read_count INT,
  called_ref_count INT,
  call_quality NUMERIC,
  PRIMARY KEY(marker_id, individual(5)),
  FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Genotypes.txt' INTO TABLE Genotypes;

-- Now create the orientation using dog as the ancestral allele...

-- Table of derived allele counts when oriented with respect to dog
CREATE TABLE DogDerivedCounts(
  marker_id INTEGER NOT NULL,
  individual TEXT NOT NULL,
  derived_count INTEGER,
  PRIMARY KEY(marker_id, individual(5)),
  FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;

-- Insert those where the reference allele is the dog allele
INSERT INTO DogDerivedCounts
SELECT Dog.marker_id, Genotypes.individual, Genotypes.called_ref_count
FROM Dog INNER JOIN Markers ON(Dog.marker_id = Markers.marker_id)
INNER JOIN Genotypes ON(Dog.marker_id = Genotypes.marker_id)
WHERE (Markers.ref_allele = Dog.ref_allele);

-- Insert those where the alternative allele is the dog allele
INSERT INTO DogDerivedCounts
SELECT Dog.marker_id, Genotypes.individual, (2 - Genotypes.called_ref_count)
FROM Dog INNER JOIN Markers ON(Dog.marker_id = Markers.marker_id)
INNER JOIN Genotypes ON(Dog.marker_id = Genotypes.marker_id)
WHERE (Markers.alt_allele = Dog.ref_allele);

The idea is that I have a table with marker information, another table where I store out group information (here a Dog, but I have other out groups I treat the same way), and then a table that provides the genotype for each individual. The latter might be the problem for my code, because of course it is harder to pick out an individual from a long list of all genotypes rather than scan through all the genotypes for that one individual, so maybe I'm better off having a genotypes table per individual, I don't know...

Anyway, now the hope was that I could write SQL expressions that would be easy to write and could be reused in both Python and R.

SELECT expressions are a lot like list comprehension in Python and R's foreach and they are pretty easy to write. Joins give me an easy way of zipping lists together to combine information. It costs me a lot in running time, though, because I have no experience in optimising SQL.

To compute ABBA/BABA patterns for D statistics and similar I wrote the code below that first filters genotypes based on read depth and call quality and then count the site patterns.

DROP TABLE IF EXISTS tmp1;
DROP TABLE IF EXISTS tmp2;
DROP TABLE IF EXISTS tmp3;

-- Filter the derived counts based on read depth and quality threshold
CREATE TEMPORARY TABLE tmp1 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "PB7"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);

CREATE TEMPORARY TABLE tmp2 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "ABC1"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);

CREATE TEMPORARY TABLE tmp3 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "GRZ"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);

-- Now collect the ABBA/BABA sites by counting the site patterns
SELECT DISTINCT Dog.chromosome, tmp1.derived_count, tmp2.derived_count, tmp3.derived_count,
(2 - tmp1.derived_count) * tmp2.derived_count * tmp3.derived_count AS ABBA,
tmp1.derived_count * (2 - tmp2.derived_count) * tmp3.derived_count AS BABA,
COUNT(*)
FROM tmp1
INNER JOIN tmp2 ON (tmp1.marker_id = tmp2.marker_id)
INNER JOIN tmp3 ON (tmp1.marker_id = tmp3.marker_id)
INNER JOIN Dog  ON (tmp1.marker_id = Dog.marker_id)
GROUP BY Dog.chromosome, tmp1.derived_count, tmp2.derived_count, tmp3.derived_count;

Having to build three temporary tables with the filtered information might actually be a stupid idea and the reason it is so slow, but I don't know...

Alternatives

I feel that I should be able to make an efficient solution with SQL, if only I knew more about database design, but perhaps I am misguided. This data has the added structure that I know that I always have information about every marker (ok, it might be missing, but then I can store it as such) and if I just keep it in lists in the right order I should be able to pull out summaries a lot faster than having to join the data together.

I have some ideas for how to code that up as a little library based on lists of the same length and queries based on list comprehension, but there must be solutions like that out there that I just don't know about.

A new isolation with migration model along complete genomes infers very different divergence processes among closely related great ape species

Friday, December 21st, 2012

We have just published a new paper in PLoS Genetics:


A New Isolation with Migration Model along Complete Genomes Infers Very Different Divergence Processes among Closely Related Great Ape Species

We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the Homo/Pan speciation event and find that the most likely scenario involves an extended period of gene flow during speciation.

It is mainly a methods paper; we could have done a lot more interesting stuff with the data analysis but since the main message of the paper, in our view, is the new inference model we really only did proof-of-concept analysis. So don't expect exciting new biological insight (well, there is one thing that's a bit exciting, but I'll get back to that).

It's the modelling that is the main result.

A bit of background for CoalHMMs

We've been working with inferring parameters of speciation - split times, effective population sizes, amount of incomplete lineage sorting - using the sequential Markov coalescence model for several years now. We just haven't called our models SMC models, mainly because we didn't think in terms of actually modelling the coalescence process in our first paper in 2007. There we just had a hidden Markov model and figured out how to get coalescence parameters out of its transition matrix.

We did parameterise the model directly from the coalescence process in 2009 - in the process figuring out that some of the symmetries we thought we needed the first model shouldn't be there, while others should. I still prefer to distinguish between the population genetics model, SMC, and the inference model model, CoalHMM (or SMCHMM maybe, but that is harder to pronounce).

SMC (and SMC' for that matter) assumes that there is at most one recombination between any two neighbouring genealogies.  This is mathematically correct in those models since they work with continuous space. No two recombinations can ever hit the same point there.

It is also perfectly sensible to assume that only one recombination occurs between any pair of nucleotides when having discrete space, in most cases, since the recombination rate is very small compared to the coalescence rate.

We were comparing genomes from different species, though, where the divergence between two genomes can be several millions of years, and there it turns out that you introduce a large bias in estimating the recombination rate if you don't take this into account. Although the largest problem with the model from 2009 was not actually multiple recombinations but not allowing "back-coalescence" from a lineage into itself (which is the difference between the SMC and SMC' models).

Not realising that SMC' would probably solved most of the problems we had, and probably also because I didn't want to have to derive the equations needed to use it, we wanted a model that didn't make any assumptions about the number of recombinations and where lineages would coalesce. We of course need the Markov assumption to make any inference on a genome scale feasible, but aside from that we didn't want any other assumptions.

I was playing around with this for several months until, in a discussion with Carsten Wiuf, I had a Eureka moment. Once you accept that you have the Markov assumption, you need only worry about pairs of neighbouring nucleotides, and although the state space of all possible ancestral configurations is large it is not impossible to compute and manipulate.

You can build the state space for ancestral configurations, it is finite if you only have two nucleotides (it is as long as you have discrete space, really), and this is just one big continuous time Markov chain.  You can easily compute the probability of any run of this CTMC, and if you discretise time you get a finite number of runs (or equivalence classes of runs, really, but who cares?).

As a side note I should say that if I had know the literature better I should have known that this is the way to do it. Simonsen and Churchill did it already back in 1997 and Slatkin and Pollack something very similar to what we implemented in 2006. I just didn't know about this until long after we published our method. Anyway...

Getting a transition matrix for the HMM from this CTMC is relatively straightforward when you only look at two genomes, and we published a model doing this last year. It is very similar to PSMC in that it uses discrete time intervals as hidden states and makes inference in the coalescence process by building the transition matrix from a model of the coalescence process.

Explicitly modelling the ARG for two nucleotides, however, makes it very easy to model any kind of demographic scenarios. If you can model it with the coalescence process, you can also model it with the coalescence process limited to two nucleotides, and then use this framework to translate that into an HMM for inference.

In theory, at least. There are some practical challenges in automatically building the CTMC from a demographic scenarios - and you definitely do not want to do that by hand! - and in picking out which runs of the CTMC are equivalent and how they are mapped into the transition matrix of the HMM.

None of it is really that complicated in hindsight. There were just a bunch if ideas that were only simple once we worked them out. Like using a coloured Petri net to specify the models or using strongly connected component graphs to get the transition matrix ... none of it is stuff you really brute force your way to the solution, but it is simple enough once you have the idea.  We published how we did it in a paper earlier this year.

We've had the framework coded up for almost a year now, but it has taken surprisingly long to bugfix and validate. I'm always surprised at how long it takes from having something that works in theory until you've validated how it works in various scenarios where your assumptions are likely to be violated and we did a lot of that for this paper. You can read all about it in the supplemental material.

An isolation-with-migration CoalHMM

The model we present in this new paper is probably the simplest extension you can make in the framework to the model we published last year.

The model from last year (well, it is a bit older than that really, but it was a companion for the orangutan genome paper and so wasn't published until that paper came out last year) is a simple isolation model.  Given one genome from each of two species, it uses the distribution of "time to the most recent common ancestor" (TMRCA) to infer the ancestral effective population size and the time since the speciation.

In the new model, we add to that model a period of gene flow after the initial split in two populations until it eventually stops. This means that we work with three CTMCs in the model. One from the present day back to when gene flow is possible (looking backwards in time, as we do for coalescence models, confusing as that can be), then another where gene flow is possible, and then the final one where we have one single panmictic population.

Connecting these three CTMCs the right way, we get a single (non-homogeneous) CTMC, and from the SCC graph of this we get the transition matrix for the HMM.

Where the model really gets its information for inference is the distribution of TMRCA both in time and in space.  The distribution for the TMRCA is very different in an isolation model and an isolation-with-migrath model. For the time distribution Rosenberg and Feldman wrote a very nice review in 2002, and we looked a bit at it in a paper last year. Dutheil and Hobolth (co-authors on most of the CoalHMM papers, including the new one) also discusses it a bit in a chapter from this year. But it is not just the time distribution that is different between the two models, also how the TRMCA is distributed along a sequence alignment, and our model uses both the temporal and spatial distribution in its inference.

Using the model, we can infer when the ancestral population initially split up, how long gene flow still occurred, and how much gene flow there was. There are some symmetries in the model that makes the direction of gene flow non-identifiable, so we only estimate symmetric gene flow, though.

Comparing models with and without gene flow, we can also determine which fits the data best, and in that way determine if there was a clean allopatric split, or if there was a period of gene flow.

We cannot, though, determine if the gene flow was continuous over that time interval, or if it was a series of isolations and admixture events ... whether we can get to that with the framework we have is something we are currently working on. It looks a bit like we can see different levels of gene flow in different time intervals with it, but results are very preliminary still.

Speciation in the Great Apes

As you probably know, there are two species in three of the four great ape species; humans are the only lonely ones.  For Pan there are common chimpanzees and bonobos, for the gorillas the eastern and western gorillas are different species, and for orang-utans the Bornean and the Summatran orang-utans are different species.

From earlier studies we already have a good idea about whether those speciations were allopatric or had gene flow. There are a very few results seeing very limited gene flow between bonobos and chimpanzees, but most results show that there was a clean split, which is also the result we got in the bonobo genome paper. For the orang-utans we showed in the orang-utan genome paper we saw gene flow between the two orang-utans, and for the gorillas we saw the same in the gorilla genome paper.

So we sort of knew what to expect and used these three speciations as a proof-of-concept analysis, and got the results we expected there: for bonobos and chimpanzees, the clean split model is preferred, and for the other two the migration model is preferred.

Our estimates of the initial population split and end of gene flow are also in good concordance with estimates from the literature, where we have them.

One thing that might be a bit odd is the timing of the bonobo-chimpanzee split. Even using a lower mutation rate than we normally would (for very good reasons), the split is recent compared to the formation of the Congo River (from what I can find out about that, it seems very hard to figure out). The formation of the Congo River has been the argument for why that particular split was instant rather than having a long period of gene flow, but that doesn't quite fit what we see - or what earlier estimates have seen either, for that matter.

The split between humans and chimpanzees

Not surprisingly we were asked in reviews to also look at the human and chimpanzee split. As a race, we are quite self absorbed so that split is of course more interesting than what was going on in the other great apes...

Looking at this split, our gene flow model is preferred over the clean split, and we actually see quite a long period with limited gene flow.

Now whether that surprises you or not, I don't know, but it doesn't seem that far fetched to me, considering that we have seen gene flow in most of the other great ape splits and we now know that there were a at least two or three admixture events between modern humans and archaic humans.

Statistically, the simpler model should be the null model, but my prior is now preferring gene flow over a clean split.

Admixture between protohumans and protochimpanzees was proposed by Patterson et al. in their "complex speciation" scenario in 2006. Most papers that have looked at this scenario since have preferred a clean split, and the evidence for the complex speciation scenario can be explained in other ways.  The large variation in TMRCA can be explained just by a large effective population size and the difference between X and autosomes by sperm competition or just selection being different on X as we think it is in central chimpanzees.

We cannot really see if the gene flow we estimate between chimpanzees and humans is any kind of complex scenario with long isolation followed by admixture - we need to model and test that explicitly - but we do see gene flow, and the time between the initial split and the end of gene flow aren't far from the initial split and then the admixture proposed by Patterson et al.


Blind reviews? For or against?

Friday, November 2nd, 2012

I've been away for a long time. RSI really keeps me away from blogging and I'm not planning to get back to blogging full time, but I really miss it so I hope to get back to writing a bit here and there...

Anyway, there is one thing that's been on my mind for a while, and I'd like input on it.

Should reviews be open or blind?

I've discussed this with colleagues for a while and I see all the points for or against, but really haven't made up my mind. Ideally I think that reviews are such an important part of the publication process that we cannot ignore it, but I see a lot of good arguments for blind reviews as well.

First some personal background: about five-ten years ago, I decided to accept all papers I was asked to review. I figured that it was 1) important to do my duty for the scientific community to review papers and 2) I would learn a lot from reviewing as many papers as I could get my hands on.

Okay, that was a very bad idea. The more papers I reviewed the more papers I was asked to review, and I ended up reviewing tens of papers a month. That means that my reviews ended up crappy since I simply did not have the time to give the papers any serious thought. I could spend two-three hours on a review max, and that is just not enough.

I felt really bad the first time I started refusing reviews, but I simply had to.

Recently I have limited the number of papers I can handle as an editor as well.  I just cannot handle too many papers and give them a fair judgment.

I feel bad about this, but is a question of survival. I cannot handle as many papers as I think I should, so I have to limit it.

This brings me back to the title of this post. Should reviews be blind or not?

When I was overwhelmed by papers I dropped signing my reviews. I didn't feel comfortable with the quality of my reviews and I am sorry to say that it meant that I could do the reviews but not be honest and stand behind them.

I can't really live with that. I don't want to give people criticism that I cannot be man enough to stand behind, so I have seriously limited the number of papers I accept to review and now I sign my reviews again.

This now leads me to a serious question: how does me signing the reviews affect my reviews?

I would like to say that my reviews have always been of high quality, but obviously they haven't been of equal quality since I now handle fewer manuscripts when I sign them. I obviously feel more reason to write better reviews when I sign them.

Having to sign my reviews clearly, for me at least, means that I am much more careful about the report I write.

Now I never write a report unless I've been thinking about the paper for a couple of days. And I think my reviews are much better for it.

That sounds like I am all in favour for open reviews (and generally I am) but I also see the problems with this.

At the pub in Cambridge my last visit there we discussed this. It is easy to sign a positive review but you can get in trouble when you sign negative reports, so why bother?

If open reviews means anything it means that you always sign reviews even if they are negative, and that can be a problem since you will make enemies when you are giving negative reviews.

I really don't know.

I know that my reviews are better now that I sign them, but I am also much more selective on the papers I want to review, so that puts a limit on what I will review. I still spend about a day a week on reviewing but I am much more selective on what papers I accept and I only accept papers I am likely to be positive about.

Even if we don't require reviews to be open, would we make better reviews if reviews were published together with papers, perhaps as supplemental material?  I don't know

I would love your thoughts.

Biology of Genomes

Sunday, May 15th, 2011

I just got back from the Biology of Genomes meeting at Cold Spring Harbor Labs in NY.  It was my second time there, but a conference I feel I should attend every year.  It is really great for my line of work, and it is great to talk to so many people working on genome analysis and to see so many friends from former and current projects.

It is a bit tough as well. The program goes from 9am to 11pm every day, which is tough when you are jetlagged.

If it wasn't for my RSI problems I would have blogged from it, but I am still keeping the blogging at an absolute minimum.

Anyway, if you are into sequencing, comparative genomics or sequencing for disease mapping, this is definitely a must conference to attend.