Framework for whole genome sequence analysis
Friday, October 10th, 2008I’m working on a project right now where I am analysing a whole genome alignment of human, chimp, orangutan and macaque.
I’ve downloaded the alignment from the UCSC genome browser, it is in the MAF format, and I use the bx-python module from Penn State for the analysis. That is a very nice module, by the way.
For scanning through the alignment and collecting various statistics, MAF+bx-python is a fine combination, but now I have some interesting regions I want to analyse in more depth, and then the combination is not quite good enough.
The MAF files are essentially sequential chunks of alignments, and to find a particular region in the genome, I need to scan through a large file until I find the region.
I have solved this problem by extracting the regions and put them in a number of different files, that I can then easier access. But wouldn’t it be nice if I could just seek() to the relevant position? In any of the species coordinates just to make it even nicer?
The bx-python module lets me extract sub-alignments, so
s = a.slice(begin,end)
extracts the sub-alignment of a beginning at column begin and ending at end-1.
(I can also get a sub-set of species, if you are wondering, but the method is slightly different for that).
I would love to be able to get a sub-alignment based on species coordinates instead of column numbers, so
s = a.slice(begin,end,coordinates='hg18')
would give me the alignment that maches the region [begin,end[ in humans (hg18 assembly). With a proper data representation, it should also give me this without scanning linearly through the file…
Does anyone know about a multiple alignment representation that lets me look up slices in files without scanning through all the alignment leading up to that location?
Surely genome browsers do not scan linearly through genomes to display their information — it would never be fast enough…
What do people use?