Fast admixture analysis and population tree estimation for SNP and NGS data

Jade Yu Cheng, Thomas Mailund, and Rasmus Nielsen

Bioinformatics (2017)

Motivation: Structure methods are highly used population genetic methods for classifying individuals in a sample fractionally into discrete ancestry components.
Contribution: We introduce a new optimization algorithm for the classical STRUCTURE model in a maximum likelihood framework. Using analyses of real data we show that the new method finds solutions with higher likelihoods than the state-of-the-art method in the same computational time. The optimization algorithm is also applicable to models based on genotype likelihoods, that can account for the uncertainty in genotype-calling associated with Next Generation Sequencing (NGS) data. We also present a new method for estimating population trees from ancestry components using a Gaussian approximation. Using coalescence simulations of diverging populations, we explore the adequacy of the STRUCTURE-style models and the Gaussian assumption for identifying ancestry components correctly and for inferring the correct tree. In most cases, ancestry components are inferred correctly, although sample sizes and times since admixture can influence the results. We show that the popular Gaussian approximation tends to perform poorly under extreme divergence scenarios e.g. with very long branch lengths, but the topologies of the population trees are accurately inferred in all scenarios explored. The new methods are implemented together with appropriate visualization tools in the software package Ohana.

Automatic differentiation in R

I’ve been working on a small R package that does automatic differentiation. It takes a function that computes an arithmetic expression as input and outputs a function that computes the derivative of the expression. You can check it out on GitHub.

I got inspired to write it a few weeks ago when one of our PhD students gave a talk on automatic differentiation. I didn’t attend the talk, but remembered playing around with it as a meta-program in C++ templates ages ago. Now that I am writing a book on meta-programming in R, I thought it would be a cool example to include there—and I have included it in the chapter I just finished. I gave it to a student as a project, but I am not patient enough to let someone else program it, so I have also done it myself.

It is actually a nice exercise to do. Differentiation is pretty simple to program. You just follow the rules you learned in calculus for the arithmetic operations and apply the chain rule for function calls. Nothing complicated there. To make it a meta-program in R, though, you need to know how to work with expressions and how to inspect functions to correctly apply the chain rule. While this is not particularly hard, this example is great at getting around the various corners of working with expressions.

Unless I think up something else to add, I think the meta-programming book will be done after one more chapter. After that, I will take a short break from the R books. I will get back to them in a few weeks, I imagine, but I have a few other projects to focus on before then. Including proof-reading my data science book—that should arrive next week and then I have to get through it in a week before it goes to the printer.

I haven’t decided yet what the next R book should be. I’m thinking either functional data structures and algorithms or embedded domain-specific languages. Let me know what you think.

admixturegraph: An R Package for Admixture Graph Manipulation and Fitting

K Leppälä, SV Nielsen, and T Mailund

Preprint at Bioinformatics

Admixture graphs generalise phylogenetic trees by allowing genetic lineages to merge as well as split. In this paper we present the R package admixturegraph containing tools for building and visualising admixture graphs, for fitting graph parameters to genetic data, for visualising goodness of fit, and for evaluating the relative goodness of fit between different graphs.

Evidence that the rate of strong selective sweeps increases with population size in the great apes

K Nam, K Munch, T Mailund, A Nater, MP Greminger, M Krützenc, T Marquès-Bonet, and MH Schierup

New paper out in PNAS

The rate of genomic adaptation is determined by the rate of environmental change, the availability of beneficial mutations, and the efficiency of positive selection. The relative importance of these factors has been actively discussed. We address the questions using whole genome sequences of great apes, which have very different population sizes whereas their genomic architectures are highly similar. We infer that the impact of selection on the genomic diversity of a species increases with the effective population size, most likely due to the differential influx rate of beneficial mutations. This explanation is, among other possibilities, expected if adaptive evolution is limited by the waiting time for new favorable mutations in great apes.

Extreme genomic erosion after recurrent demographic bottlenecks in the highly endangered Iberian lynx

Genomic studies of endangered species provide insights into their evolution and demographic history, reveal patterns of genomic erosion that might limit their viability, and offer tools for their effective conservation. The Iberian lynx (Lynx pardinus) is the most endangered felid and a unique example of a species on the brink of extinction.

We generate the first annotated draft of the Iberian lynx genome and carry out genome-based analyses of lynx demography, evolution, and population genetics. We identify a series of severe population bottlenecks in the history of the Iberian lynx that predate its known demographic decline during the 20th century and have greatly impacted its genome evolution. We observe drastically reduced rates of weak-to-strong substitutions associated with GC-biased gene conversion and increased rates of fixation of transposable elements. We also find multiple signatures of genetic erosion in the two remnant Iberian lynx populations, including a high frequency of potentially deleterious variants and substitutions, as well as the lowest genome-wide genetic diversity reported so far in any species.

The genomic features observed in the Iberian lynx genome may hamper short- and long-term viability through reduced fitness and adaptive potential. The knowledge and resources developed in this study will boost the research on felid evolution and conservation genomics and will benefit the ongoing conservation and management of this emblematic species.

Federico Abascal†, André Corvelo†, Fernando Cruz†, José L. Villanueva-Cañas, Anna Vlasova, Marina Marcet-Houben, Begoña Martínez-Cruz, Jade Yu Cheng, Pablo Prieto, Víctor Quesada, Javier Quilez, Gang Li, Francisca García, Miriam Rubio-Camarillo, Leonor Frias, Paolo Ribeca, Salvador Capella-Gutiérrez, José M. Rodríguez, Francisco Câmara, Ernesto Lowy, Luca Cozzuto, Ionas Erb, Michael L. Tress, Jose L. Rodriguez-Ales, Jorge Ruiz-Orera, Ferran Reverter, Mireia Casas-Marce, Laura Soriano, Javier R. Arango, Sophia Derdak, Beatriz Galán, Julie Blanc, Marta Gut, Belen Lorente-Galdos, Marta Andrés-Nieto, Carlos López-Otín, Alfonso Valencia, Ivo Gut, José L. García, Roderic Guigó, William J. Murphy, Aurora Ruiz-Herrera, Tomas Marques-Bonet, Guglielmo Roma, Cedric Notredame, Thomas Mailund, M. Mar Albà, Toni Gabaldón, Tyler Alioto and José A. Godoy

Genome Biology201617:251
DOI: 10.1186/s13059-016-1090-1](