Day two of APBC: Afternoon session

We skipped the first half of the afternoon session.  None of the two tracks seemed particularly interesting to us, so Søren took the time to prepare slides for his presentation tomorrow, and I spent the time reading some of the papers from the morning session.

In the second half of the session there was a track on association mapping and genomic variation that we went to.

Unfortunately, it was held in the main conference hall that they, for some reason, keep insainly hot, so if you are still running on GMT+1 time (and wake up at 5am local time) it is almost guaranteed to put you to sleep.

We managed to hear the first two presentations, but then we left.  Too bad, I would really have liked to hear the other two as well, but I just cannot keep awake in there.

The two presentations we did manage to hear were

Copy-number-variation and copy-number-alteration region detection by cumulative plots W Li, A Lee and PK Gregersen

on a new type of plots that should make it easier to identify copy-number-variation from SNP genotyped data from a single diploid individual, and

Identifying disease associations via genome-wide association studies W Huang, P Wang, Z Liu and L Zhang

on looking for genetic commonalities between diseases by clustering regions of SNPs with (marginally) significant association with the diseases.


Day two of APBC: Morning session

For the morning paper presentation session, I attended the sequence assembly track.

The papers here all concerned the new algorithmical problems you need to tackle to handle next generation sequencing technologies, with vastly more data and much smaller reads.

Parallel short sequence assembly of transcriptomes BG Jackson, PS Schnable and S Aluru

The first presentation was about a distributed graph algorithm for de novo assembly.

Graph algorithms are a nice approach to sequence assembly, but they are potentially very time and memory expensive.  The method here distributes both the memory usage and the computations on multiple CPUs, thus alleviating this problem.

Finding optimal threshold for correction error reads in DNA assembling FYL Cin, HCM Leung, W-L Li and S-M Y

The second presentation was on error correction.

With NGS you get a very high number of reads, but a few percentage of the nucleotides in the reads are called incorrectly.  This is corrected for by requiring that each K-mer (for a given K) should occur at least M times (where M is some threshold) before it is belived to be a correct read.

The problem addressed here was how to choose M given a data set.  The approach was to model the sequences as generated by a stochastic process and the estimate the expected number of false positives and false negatives for each M and then picking the M that minimises the sum of FP and FN.

Crystallizing short-read assembly around seeds MS Hossain, N Azimi and S Skiena

The third presentation was on a new de novo assembly algorithm taylored to the paired end reads you get from the SOLiD platform.

The first half of the presentation, though, was an overview of various platforms, so I’ll need to read the paper before I have any idea about the specifics of the algorithm.

Short read DNA fragment anchoring algorithm W Wang, P Zhang and X Liu

The last presentation was not on de novo assembly but on reference genome assembly, and concerned finding anchors (sub-strings of a larger string that approximately matches a query string).

This time around I didn’t get any of the details.  Perhaps because it was getting close to lunch and I was fading out…


Day two of APBC: Invited talks

The morning session today started with two invited talks:

Bailin Hao: Independent verification of 16S rRNA based prokaryotic phylogeny by composition vector approach

The first one was a bit strange.  The presentation consisted of reading the slides aloud and the topic was never really completely clear to me.

They had, apparently, constructed a phylogeny for prokaryotes using a new method called CVTree — essentially neighbour-joining but with a distance measure based on K-mer statistics and thus parameter and alignment free — and compared that with the tree of life 16S RNA phylogeny and the taxonomy.

The talk was mainly bashing the taxonomy for having classes that are not monophylitic, but there wasn’t much on actual tree comparisons of the tree hierarchies/phylogenies.

Pavel Pevzner: Genome rearrangements: from biological problems to combinatorial algorithms

The second talk, on the other hand, was absolutely great.

Pevzner first warned us that he would give a more “computer science” talk than most of the talks so far, and would prove three theorems (one of which would be wrong).

He took three controversial biological problems

  1. Do rodents and primates group, with carnivores as an outgroup, or do primates and carnivores group with rodents as an outgroup
  2. Does the mamalian genome contain rearrangement hotspots or are rearrangements randomly distributed
  3. Was there a whole genome duplication in yeast

and he reduced these questions to combinatorial problems that can be tested

  1. Ancestral genome reconstruction
  2. Breakpoint re-use analysis
  3. The genome halving problem

The first problem is essentially a matter of constructing a parsimony tree based on rearrangement events, and he showed that that favoured the ((carnivore,primate),rodent) topology.

The second problem was a question on putting a lower bound on the number of rearrangement events between human and mouse and showing that this lower bound was greater than the observed number of breakpoints, which means that some breakpoints must have been reused.

Time ran out before he could talk about the third problem, so I don’t know what the results are there.

I would have preferred if he had been given more time, ’cause it was really interesting.

The incorrect theorem, by the way, was in a proof for the number of steps needed in transposition sorting, which complicated the results on the second problem.

14-19 = -5