Efficiently handling largish genetic data sets
Sunday, April 7th, 2013Okay, so I'm working on this dataset containing called variants from about 20 sequenced animals from two species and maybe a pair of hybrids. It's not an enormous data set, but there is about 12 million markers and when I load it into R without being clever about it, it doesn't fit into RAM on my desktop machine.
That tells me two things: 1) I need to get a new desktop computer (hell, the laptop I'm writing this blogpost on has more memory than my desktop computer) and 2) I need to be more clever in how I manage data (because the next data set I have in my pipeline has hundreds of individuals from several species and a lot more variable sites; upgrading the computer will not save me there).
There is nothing of what I'm doing right now that I can't simply do by reading line by line through a VCF file, and that is how I've solved my problems so far. It is just pretty time consuming and I was hoping I could come up with something that would be easier and faster.
Speed is important up to a point. It would be nice to be able to work with the data interactively, but as long as I can handle my scans in a few hours on our computer cluster it is fast enough. Ease of use is more important, because while I can deal with waiting a day for a computation to finish I find it much harder to set aside a day for programming the scripts for the computation.
With R data.frames I find it pretty easy to pull out most information for an easy computation. With the foreach and plyr packages it is even easier. It isn't that hard doing it in Python either, although there is usually a bit more IO code for parsing up files and writing the output to files so I can load it into R afterward.
I know there are packages for handling large data sets in R, but since I would need to come up with a new solution for what I'm doing anyway I thought I might try out a few things and maybe come up with a solution that is easy to work with from both Python and R (since I often use both to manipulate the data I'm analysing).
Using an SQL database for genetic data
The data I have might be large when I have to keep it in RAM but it isn't large compared to most databases, so I figured I could try an SQL solution. SQLite would be my preferred solution since it doesn't require a server (and I don't have access to my server from my computing grid) but I have played around with both SQLite and MySQL this weekend.
I'm pretty sure I'm doing something wrong, though, 'cause my code is running very slow. Slower than if I just scanned through flat text files at least.
I've set up the database like this (here in MySQL syntax):
DROP TABLE IF EXISTS DogDerivedCounts;
DROP TABLE IF EXISTS Genotypes;
DROP TABLE IF EXISTS Dog;
DROP TABLE IF EXISTS Markers;
-- Create the three main tables, reading the data in from files
CREATE TABLE Markers (
marker_id INTEGER NOT NULL PRIMARY KEY,
scaffold TEXT NOT NULL,
POSITION INTEGER NOT NULL,
ref_allele TEXT NOT NULL,
alt_allele TEXT NOT NULL,
qual_score FLOAT NOT NULL)
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Markers.txt' INTO TABLE Markers;
CREATE TABLE Dog (
marker_id INTEGER NOT NULL PRIMARY KEY,
chromosome TEXT NOT NULL,
POSITION INTEGER NOT NULL,
ref_allele TEXT NOT NULL,
FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Dog.txt' INTO TABLE Dog;
CREATE TABLE Genotypes (
marker_id INTEGER NOT NULL,
individual TEXT NOT NULL,
ref_read_count INT,
alt_read_count INT,
called_ref_count INT,
call_quality NUMERIC,
PRIMARY KEY(marker_id, individual(5)),
FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;
LOAD DATA LOCAL INFILE 'Genotypes.txt' INTO TABLE Genotypes;
-- Now create the orientation using dog as the ancestral allele...
-- Table of derived allele counts when oriented with respect to dog
CREATE TABLE DogDerivedCounts(
marker_id INTEGER NOT NULL,
individual TEXT NOT NULL,
derived_count INTEGER,
PRIMARY KEY(marker_id, individual(5)),
FOREIGN KEY(marker_id) REFERENCES Markers(marker_id))
ENGINE=InnoDB;
-- Insert those where the reference allele is the dog allele
INSERT INTO DogDerivedCounts
SELECT Dog.marker_id, Genotypes.individual, Genotypes.called_ref_count
FROM Dog INNER JOIN Markers ON(Dog.marker_id = Markers.marker_id)
INNER JOIN Genotypes ON(Dog.marker_id = Genotypes.marker_id)
WHERE (Markers.ref_allele = Dog.ref_allele);
-- Insert those where the alternative allele is the dog allele
INSERT INTO DogDerivedCounts
SELECT Dog.marker_id, Genotypes.individual, (2 - Genotypes.called_ref_count)
FROM Dog INNER JOIN Markers ON(Dog.marker_id = Markers.marker_id)
INNER JOIN Genotypes ON(Dog.marker_id = Genotypes.marker_id)
WHERE (Markers.alt_allele = Dog.ref_allele);
The idea is that I have a table with marker information, another table where I store out group information (here a Dog, but I have other out groups I treat the same way), and then a table that provides the genotype for each individual. The latter might be the problem for my code, because of course it is harder to pick out an individual from a long list of all genotypes rather than scan through all the genotypes for that one individual, so maybe I'm better off having a genotypes table per individual, I don't know...
Anyway, now the hope was that I could write SQL expressions that would be easy to write and could be reused in both Python and R.
SELECT expressions are a lot like list comprehension in Python and R's foreach and they are pretty easy to write. Joins give me an easy way of zipping lists together to combine information. It costs me a lot in running time, though, because I have no experience in optimising SQL.
To compute ABBA/BABA patterns for D statistics and similar I wrote the code below that first filters genotypes based on read depth and call quality and then count the site patterns.
DROP TABLE IF EXISTS tmp2;
DROP TABLE IF EXISTS tmp3;
-- Filter the derived counts based on read depth and quality threshold
CREATE TEMPORARY TABLE tmp1 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "PB7"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);
CREATE TEMPORARY TABLE tmp2 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "ABC1"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);
CREATE TEMPORARY TABLE tmp3 AS
SELECT ddc.* FROM DogDerivedCounts AS ddc
INNER JOIN Genotypes AS gt
ON (ddc.marker_id = gt.marker_id AND ddc.individual = gt.individual)
WHERE (ddc.individual = "GRZ"
AND (gt.ref_read_count + gt.alt_read_count) >= 10
AND gt.call_quality >= 30);
-- Now collect the ABBA/BABA sites by counting the site patterns
SELECT DISTINCT Dog.chromosome, tmp1.derived_count, tmp2.derived_count, tmp3.derived_count,
(2 - tmp1.derived_count) * tmp2.derived_count * tmp3.derived_count AS ABBA,
tmp1.derived_count * (2 - tmp2.derived_count) * tmp3.derived_count AS BABA,
COUNT(*)
FROM tmp1
INNER JOIN tmp2 ON (tmp1.marker_id = tmp2.marker_id)
INNER JOIN tmp3 ON (tmp1.marker_id = tmp3.marker_id)
INNER JOIN Dog ON (tmp1.marker_id = Dog.marker_id)
GROUP BY Dog.chromosome, tmp1.derived_count, tmp2.derived_count, tmp3.derived_count;
Having to build three temporary tables with the filtered information might actually be a stupid idea and the reason it is so slow, but I don't know...
Alternatives
I feel that I should be able to make an efficient solution with SQL, if only I knew more about database design, but perhaps I am misguided. This data has the added structure that I know that I always have information about every marker (ok, it might be missing, but then I can store it as such) and if I just keep it in lists in the right order I should be able to pull out summaries a lot faster than having to join the data together.
I have some ideas for how to code that up as a little library based on lists of the same length and queries based on list comprehension, but there must be solutions like that out there that I just don't know about.




