GOTO: HELL

So, about a year, maybe two, a professor from History of Ideas wanted to create an interdiciplinary centre for human evolution at AU. It wasn’t funded but during the grant writing phase I wrote a bit for it and thought it would be a great idea.

It wasn’t funded, as I said, but when looking at the plans for the project I realised that funding wasn’t necessary for most of what was intended for the centre.

Essentially the idea was to get people at my university together to talk human evolution. Anyone who is interested in human evolution and human prehistory. I am 100% behind that.

At BiRC we think we are on the beat about human evolution but there are so many aspects of it that we don’t think about just because we don’t know about it. It is the unknown unknowns.

It would be really great to get together people from different groups to chat. If nothing else just to chat.

There’s a guy at the university hospital who’s written books (that I have really enjoyed reading) about human evolution. Why don’t I chat with him often? Because I don’t have the opportunity!

So when funding wasn’t given I emailed Peter Kjaergaard (the guy who applied) and asked why we didn’t just do the project anyway without funding.

Turns out there is no reason, so we started the project without funding. We just email people we think are interested and meet for lunches or such before or after seminars we think will be interesting.

Peter then had the great idea of holding our own lecture series. Informal meetings where we have lunch and talk about what one of our own is doing. We called it Human Evolution Lunch Lectures (although he apparently thought the acronym was too HELL’ish so on the homepage they are now called Human Evolution Lunch Seminars). That has to change because as everyone knows you go to hell if you believe in evolution.

We’ve only had one of those, but it was great. A talk about the evolution of our fascination of horror movies. It turns out that the humanities have something to add to our entertainment and I for one am happier to fund it now.

Besides the HELL part we have just gotten together once a month to have lunch and chat. I didn’t know, I really didn’t know, how much relevant work is done at my own university! Departments I have never heard about are doing work very related to what I am doing, just in very different fields of research.

I don’t know how to describe what we are doing, but I am very happy that we are doing it!

Efficiently handling largish genetic data sets

Okay, so I’m working on this dataset containing called variants from about 20 sequenced animals from two species and maybe a pair of hybrids. It’s not an enormous data set, but there is about 12 million markers and when I load it into R without being clever about it, it doesn’t fit into RAM on my desktop machine.

That tells me two things: 1) I need to get a new desktop computer (hell, the laptop I’m writing this blogpost on has more memory than my desktop computer) and 2) I need to be more clever in how I manage data (because the next data set I have in my pipeline has hundreds of individuals from several species and a lot more variable sites; upgrading the computer will not save me there).

There is nothing of what I’m doing right now that I can’t simply do by reading line by line through a VCF file, and that is how I’ve solved my problems so far. It is just pretty time consuming and I was hoping I could come up with something that would be easier and faster.

Speed is important up to a point. It would be nice to be able to work with the data interactively, but as long as I can handle my scans in a few hours on our computer cluster it is fast enough. Ease of use is more important, because while I can deal with waiting a day for a computation to finish I find it much harder to set aside a day for programming the scripts for the computation.

With R data.frames I find it pretty easy to pull out most information for an easy computation. With the foreach and plyr packages it is even easier. It isn’t that hard doing it in Python either, although there is usually a bit more IO code for parsing up files and writing the output to files so I can load it into R afterward.

I know there are packages for handling large data sets in R, but since I would need to come up with a new solution for what I’m doing anyway I thought I might try out a few things and maybe come up with a solution that is easy to work with from both Python and R (since I often use both to manipulate the data I’m analysing).

Using an SQL database for genetic data

The data I have might be large when I have to keep it in RAM but it isn’t large compared to most databases, so I figured I could try an SQL solution. SQLite would be my preferred solution since it doesn’t require a server (and I don’t have access to my server from my computing grid) but I have played around with both SQLite and MySQL this weekend.

I’m pretty sure I’m doing something wrong, though, ’cause my code is running very slow. Slower than if I just scanned through flat text files at least.

I’ve set up the database like this (here in MySQL syntax):

-- Drop the tables, we assume we are starting from scratch
DROP TABLE IF EXISTS DogDerivedCounts;
DROP TABLE IF EXISTS Genotypes;
DROP TABLE IF EXISTS Dog;
DROP TABLE IF EXISTS Markers;

-- Create the three main tables, reading the data in from files
CREATE TABLE Markers (
  marker_id INTEGER NOT NULL PRIMARY KEY,
  scaffold TEXT NOT NULL,
  POSITION INTEGER NOT NULL,
  ref_allele TEXT NOT NULL,
  alt_allele TEXT NOT NULL,
  qual_score FLOAT NOT NULL)
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Markers.txt' INTO TABLE Markers;

CREATE TABLE Dog (
  marker_id INTEGER NOT NULL PRIMARY KEY,
  chromosome TEXT NOT NULL,
  POSITION INTEGER NOT NULL,
  ref_allele TEXT NOT NULL,
  FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Dog.txt' INTO TABLE Dog;

CREATE TABLE Genotypes (
  marker_id INTEGER NOT NULL,
  individual TEXT NOT NULL,
  ref_read_count INT,
  alt_read_count INT,
  called_ref_count INT,
  call_quality NUMERIC,
  PRIMARY KEY(marker_id, individual(5)),
  FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Genotypes.txt' INTO TABLE Genotypes;

-- Now create the orientation using dog as the ancestral allele...

-- Table of derived allele counts when oriented with respect to dog
CREATE TABLE DogDerivedCounts(
  marker_id INTEGER NOT NULL,
  individual TEXT NOT NULL,
  derived_count INTEGER,
  PRIMARY KEY(marker_id, individual(5)),
  FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;

-- Insert those where the reference allele is the dog allele
INSERT INTO DogDerivedCounts
SELECT Dog.marker_id, Genotypes.individual, Genotypes.called_ref_count
FROM Dog INNER JOIN Markers ON(Dog.marker_id = Markers.marker_id)
INNER JOIN Genotypes ON(Dog.marker_id = Genotypes.marker_id)
WHERE (Markers.ref_allele = Dog.ref_allele);

-- Insert those where the alternative allele is the dog allele
INSERT INTO DogDerivedCounts
SELECT Dog.marker_id, Genotypes.individual, (2 - Genotypes.called_ref_count)
FROM Dog INNER JOIN Markers ON(Dog.marker_id = Markers.marker_id)
INNER JOIN Genotypes ON(Dog.marker_id = Genotypes.marker_id)
WHERE (Markers.alt_allele = Dog.ref_allele);

The idea is that I have a table with marker information, another table where I store out group information (here a Dog, but I have other out groups I treat the same way), and then a table that provides the genotype for each individual. The latter might be the problem for my code, because of course it is harder to pick out an individual from a long list of all genotypes rather than scan through all the genotypes for that one individual, so maybe I’m better off having a genotypes table per individual, I don’t know…

Anyway, now the hope was that I could write SQL expressions that would be easy to write and could be reused in both Python and R.

SELECT expressions are a lot like list comprehension in Python and R’s foreach and they are pretty easy to write. Joins give me an easy way of zipping lists together to combine information. It costs me a lot in running time, though, because I have no experience in optimising SQL.

To compute ABBA/BABA patterns for D statistics and similar I wrote the code below that first filters genotypes based on read depth and call quality and then count the site patterns.

DROP TABLE IF EXISTS tmp1;
DROP TABLE IF EXISTS tmp2;
DROP TABLE IF EXISTS tmp3;

-- Filter the derived counts based on read depth and quality threshold
CREATE TEMPORARY TABLE tmp1 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "PB7"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);

CREATE TEMPORARY TABLE tmp2 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "ABC1"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);

CREATE TEMPORARY TABLE tmp3 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "GRZ"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);

-- Now collect the ABBA/BABA sites by counting the site patterns
SELECT DISTINCT Dog.chromosome, tmp1.derived_count, tmp2.derived_count, tmp3.derived_count,
(2 - tmp1.derived_count) * tmp2.derived_count * tmp3.derived_count AS ABBA,
tmp1.derived_count * (2 - tmp2.derived_count) * tmp3.derived_count AS BABA,
COUNT(*)
FROM tmp1
INNER JOIN tmp2 ON (tmp1.marker_id = tmp2.marker_id)
INNER JOIN tmp3 ON (tmp1.marker_id = tmp3.marker_id)
INNER JOIN Dog  ON (tmp1.marker_id = Dog.marker_id)
GROUP BY Dog.chromosome, tmp1.derived_count, tmp2.derived_count, tmp3.derived_count;

Having to build three temporary tables with the filtered information might actually be a stupid idea and the reason it is so slow, but I don’t know…

Alternatives

I feel that I should be able to make an efficient solution with SQL, if only I knew more about database design, but perhaps I am misguided. This data has the added structure that I know that I always have information about every marker (ok, it might be missing, but then I can store it as such) and if I just keep it in lists in the right order I should be able to pull out summaries a lot faster than having to join the data together.

I have some ideas for how to code that up as a little library based on lists of the same length and queries based on list comprehension, but there must be solutions like that out there that I just don’t know about.