Archive for February, 2008

Estimating local ancestry

Tuesday, February 26th, 2008

ResearchBlogging.org

When two populations A and B meet and start to mix, the resulting population will — for the first many generations, at least — be a mix of the two original populations. It is not that each individual will belong to one of the original populations and that the mixed population will consist of such “original population” A or B individuals. At least not after a few generations. Instead, each individual will be a mix of population A and B.

For several generations following the merge of populations A and B, mutations will not change the genes in the mixed population much. It takes a long time for mutations to accumulate. Each gene in the population will essentially be unchanged compared to the gene in one (or both) of the ancestor populations. Here, by gene, I simply mean a “chunk” of DNA, not necessarily a functional bit, so don’t read too much into it. In any case, the genes will not change much, but will look like genes from A or from B, where of course genes from A can differ significantly fro genes from B.

Recombination will shuffle the genes from A and B around, however. If a “mainly A” individual mates with a “mainly B” individual, the offspring will inherit both A and B genes in some combination. As you scan along a chromosome, the ancestral population will change back and forth between A genes and B genes.

Just based on samples from the present day chromosomes, can we infer the local ancestry, i.e. which chunks of each chromosome came from A and which came from B? In this months issue of American Journal of Human Genetics, there is a paper that addresses this exact problem:

Estimating Local Ancestry in Admixed Populations
Sankararaman et al.
The American Journal of Human Genetics 82(2) 290-30

Abstract

Large-scale genotyping of SNPs has shown a great promise in identifying markers that could be linked to diseases. One of the major obstacles involved in performing these studies is that the underlying population substructure could produce spurious associations. Population substructure can be caused by the presence of two distinct subpopulations or a single pool of admixed individuals. In this work, we focus on the latter, which is significantly harder to detect in practice. New advances in this research direction are expected to play a key role in identifying loci that are different among different populations and are still associated with a disease. We evaluated current methods for inference of population substructure in such cases and show that they might be quite inaccurate even in relatively simple scenarios. We therefore introduce a new method, LAMP (Local Ancestry in adMixed Populations), which infers the ancestry of each individual at every single-nucleotide polymorphism (SNP). LAMP computes the ancestry structure for overlapping windows of contiguous SNPs and combines the results with a majority vote. Our empirical results show that LAMP is significantly more accurate and more efficient than existing methods for inferrring locus-specific ancestries, enabling it to handle large-scale datasets. We further show that LAMP can be used to estimate the individual admixture of each individual. Our experimental evaluation indicates that this extension yields a considerably more accurate estimate of individual admixture than state-of-the-art methods such as STRUCTURE or EIGENSTRAT, which are frequently used for the correction of population stratification in association studies

Inferring local ancestry

The method in this paper makes a few simplifying assumptions that makes the method computational efficient to run.

They assume that the samples considered are from a mix of populations that contributed to the sample in known frequencies — e.g. that population A contributed with 80% and B with 20% — a known number of generations ago. The method is not that sensitive to knowing exactly the fractions of populations or the number of generations — and using other methods you can infer these parameters anyway — but assuming that you know these parameters helps in the mathematics of the method.

A more important assumption is that you can split the chromosomes into sliding windows where recombinations do not occur inside the windows. This is obviously not correct, but it helps the method a lot and is not as silly as it sounds.

If you can split the chromosomes into sliding windows of a fixed length and then infer which chromosomes belong to which ancestral population, the inference problem is much easier to solve. If you then slide this window along the chromosomes and assign populations to the chromosomes in each window, each nucleotide will belong to different populations depending on the window considered.

The same nucleotide will belong to different populations depending on the window. Is this a problem? Yes and no.

For the windows that overlap a given nucleotide, the method takes a vote, and the majority decides which population the nucleotide “really” belongs to. That way you get a unique population per nucleotide.

This is a pretty good idea. This way you get the fast computation of the ancestral population inference and on average you assign the right population to the nucleotides.

You will not be able to accurately infer the break-points where the population changes between populations, but for most applications that is not that important in the first place. You want to assign the nucleotide to the right population on average, and this is what you achieve this way.

Relevance for association mapping

The motivation for the paper is association mapping, where you compare the frequency of alleles between cases and controls for a given disease, looking for markers where the frequency is different between cases and controls. Such markers are potential candidates for disease genes: if one allele is more frequent in cases than in controls, maybe it is more frequent because it increases the risk of the disease.

If your samples are from a population that is a mix of different ancestral populations, there is a high risk of biases. There is a bias in the sampling: if the ancestral populations are sampled in different ratios for cases than controls, you will pick up differences in cases and controls just because of that.

There are obvious situations where this can happen. If you sample cancer patients from an expensive clinic (rich white Americans) and the controls from the ER (with a higher ratio of African Americans), for example, you get a different ratio of African decent and European decent individuals in cases and controls.

If you do not correct for this, you are mapping “ancestral population” genes instead of disease genes.

Comparison to CoalHMMs

I didn’t actually read the paper with association mapping in mind — although the problem is extremely relevant for such studies. I do association mapping in Icelanders, so it is not that important for my own work, though.

I read it for a meeting in our “coalescent hidden Markov model” group.

With our CoalHMMs, we try to learn about speciation events. When there is lineage sorting in the speciation — as for example between humans, chimps and gorillas — the nearest neighbour species of a chromosome changes along the chromosome — in some regions humans are closer related to chimps, in other closer related to gorillas.

This setting is different than the population mixing problem. For one thing, we are not dealing with different ancestral populations mixing, but rather populations splitting up to become separate species. Still, scanning along the chromosomes and inferring which phylogeny each nucleotide belongs to is similar to the problem here.


SANKARARAMAN, S., SRIDHAR, S., KIMMEL, G., HALPERIN, E. (2008). Estimating Local Ancestry in Admixed Populations. The American Journal of Human Genetics, 82(2), 290-303. DOI: 10.1016/j.ajhg.2007.09.022

Today’s pick on The DNA Network

Tuesday, February 26th, 2008

Ok, I admit, the title is misleading, ’cause I haven’t actually looked through all the posts and picked my favourites, but I did scan it briefly and found two interesting posts that I’d like to share here.

The first is a fun story about UV light mutating (and destroying) DNA from Bitesize Bio. I don’t know what to expect here — I have never been allowed inside a lab — but it seams that UV light is a lot more damaging that expected (at least than expected by Nick, the author) so if it is common to visualise DNA using UV light, are we damaging that we should be studying?

The second is a paper review on Genetic Future on the genetics of metabolic diseases — more related to my own research areas. I haven’t read the paper myself — yet — so I am just describing the review here.

The study is on genetic variation in metabolic genes and their correlation with environmental factors, and argues for selection, rather than simple genetic drift. An interesting read.

Google cluster computing

Tuesday, February 26th, 2008

Google, together with the National Science Foundation (NFS; National here is the US) — possibly IBM as well, it isn’t quite clear from the press release — will provide cluster computing to researchers.

This YouTube video describes a Google + IBM project that now looks like it’s only a pilot for a larger one:

 

Yeah! Let’s have more of that, but remember to make it easy to use for scientists. Integrate cluster computers with the desktop!

The video mentions integration with Eclipse — does anyone know more about this?

Thoughts on peer-reviewing

Sunday, February 24th, 2008

ReviewingI am reviewing a paper for PLoS One today. At PLoS One they have an option for reviewers that they can 1) waiver anonymity to the paper authors, and 2) allow your review — or sound bites from it — to be published together with the paper, so readers of the paper can see what the reviewers thought.

This got me thinking about reviews in general.

Anonymous reviewers?

I realize that there are good reasons to keep reviewers or authors anonymous during the reviewing process.

I’ve never actually reviewed a paper where the author was not known, but there are good arguments for keeping the authors unknown to the reviewer. Like it or not, we do have prior ideas about our peers and the quality of their work, and that is likely to influence our reviews. We are more likely to believe the results of authors with proven track records than people who’s papers we have rejected on several occasions.

That being said, I do not think it is a major problem to know the authors.

What about anonymous reviewers, then?

In my experience — which admittedly is limited to computer science, biology and bioinformatics — the common case is that the reviewer is kept anonymous from the authors, and that he has the possibility to write comments that will be shown to the authors and in addition to write comments that will only be seen by the editor(s).

I’ve never used the “private” comments when reviewing, and I have never written a review that I wouldn’t put my name on. Not that all my reviews are positive — far from it, some would say — but any honest review should not be something you would be ashamed of admitting to have written.

Of course, there are reasons for anonymity here. A bad review hurts, especially if you feel that it is unfair, and hurt feelings can affect your judgment down the line. “If I get a bad review from you, then you will get a bad review from me”.

How much of an issue this is, I don’t know. I should hope it is a minor issue and that reviewers are more objective than that. If not, they shouldn’t accept the review; it is clearly a case of conflicting interests: revenge vs science.

In any case, let’s be honest, reviewers are less anonymous than you would think. Even when they are not named, their suggestions often give you a reasonable good idea as to who it is, so why not just name them?

I think there are very good reasons to name both authors and reviewers in the reviewing process. I’ve only on one occasion know a reviewer on one of my papers — not because the policy was to name the reviewers but because he explicitly signed the review — and being able to discuss the results with the reviewer was helpful.

This shouldn’t in itself be an argument for disclosing the reviewers — I can easily imagine being spammed/flamed in order to change my mind on a review — I’m just saying that there are some benefits from knowing who your reviewers are.

PLoS’s policy of letting the reviewer decide whether to disclose his name or not is a good idea, in my view. As a sign of good faith and honesty in the review, I would say a reviewer in general should choose to disclose his name, but in rare occasions where there are good reasons not to, he should be allowed not to disclose his name (and the editor should probably consider these reasons and judge if they cause a conflict with the objectivity of the review).

Of course, should a reviewer choose to remain anonymous, that should be respected. The journal should not disclose confidential information under any circumstances (under the law) — and this is the point where I refer you to the editorial in the current issue of Science that addresses exactly this issue.

Public reviews?

So much for anonymous reviewers. What about the actual reviews?

Lately I find that I often search for online reviews of papers I read. Reading blog posts about a paper is no substitute for reading the actual paper — that goes without saying — but I find it very helpful to read what other people think about a paper and what other papers they refer to. It is like one global journal club discussion.

I would love to see more of this.

I realize that I am comparing oranges and apples here. It is a very different situation reviewing a paper with the intend of judging whether it is publishable and suggesting ways to improve it, compared to commenting on an already published paper.

The quality control that is peer-reviewing cannot be substituted by the blogosphere. That would turn the whole process into a popularity contest.

Still, once a paper is published, why not publish the reviews together with it, so the reader can learn of the concerns or suggestions of the reviewers? Maybe there are good reasons to keep parts of reviews confidential, but that could be left to the editor’s discretion.

In any case, I’d love to see more public discussion on published research.

I am doing my part in this, small as it is. I probably review five times as many papers on my blog than I do for journals. I submit my reviews to Research Blogging and have recently joined CiteULike and keep a list of the papers I read there.

CiteULike, by the way, is a great place to find related literature. Try searching through the lists that overlap your own and you will find lots of papers worth reading.

A Web 2.0 journal club?

PLoS One has a comments section with blog tracebacks on each paper. That is a great way to get other opinions on a paper and to discuss a paper, but it is only one journal. For other journals, you need a bit of google’ing to find discussions.

Wouldn’t it be cool with a website that would aggregate paper reviews, discussions and related literature? A mix of Medline (and similar) for related literature, combined with CiteULike and Research Blogging for the “social networking” component.

More on worlwide and genomewide variation…

Saturday, February 23rd, 2008

ResearchBlogging.org Just to finish the trilogy — the three papers examining genome wide polymorphism in this weeks Nature and Science — I should mention Li et al.’s Science paper covering essentially the same as the Jakobsson et al. I just reviewed.

Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation

Li et al.

Abstract

Human genetic diversity is shaped by both demographic and biological factors and has fundamental implications for understanding the genetic basis of diseases. We studied 938 unrelated individuals from 51 populations of the Human Genome Diversity Panel at 650,000 common single-nucleotide polymorphism loci. Individual ancestry and population substructure were detectable with very high resolution. The relationship between haplotype heterozygosity and geography was consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. In addition, we observed a pattern of ancestral allele frequency distributions that reflects variation in population dynamics among geographic regions. This data set allows the most comprehensive characterization to date of human genetic variation.

The results do not differ that much from Jakobsson et al. but the analysis is different.

First, they use a maximum likelihood method to cluster the sampled individuals into K unknown “ancestral clusters” and considered the clustering obtained with different Ks. For increasing Ks, the individuals cluster into smaller and smaller groupings, indicating their relatedness compared to the whole sample.

Once K is high enough (K=7), the populations mainly cluster together, with most populations being derived from the same single cluster but with some populations (Middle Easterns and South/Central Asians) being a mix of the ancestral clusters.

They then construct a maximum likelihood phylogeny for the populations and find that it fits nicely with the Out of Africa model.

Considering haplotype heterozygosity, they observe that heterozygosity decreases with distance from East Africa, similar to what Jakobsson et al. reports.


Li, J.Z., Absher, D.M., Tang, H., Southwick, A.M., Casto, A.M., Ramachandran, S., Cann, H.M., Barsh, G.S., Feldman, M., Cavalli-Sforza, L.L., Myers, R.M. (2008). Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Science, 319(5866), 1100-1104. DOI: 10.1126/science.1153717