Framework for whole genome sequence analysis
I’m working on a project right now where I am analysing a whole genome alignment of human, chimp, orangutan and macaque.
I’ve downloaded the alignment from the UCSC genome browser, it is in the MAF format, and I use the bx-python module from Penn State for the analysis. That is a very nice module, by the way.
For scanning through the alignment and collecting various statistics, MAF+bx-python is a fine combination, but now I have some interesting regions I want to analyse in more depth, and then the combination is not quite good enough.
The MAF files are essentially sequential chunks of alignments, and to find a particular region in the genome, I need to scan through a large file until I find the region.
I have solved this problem by extracting the regions and put them in a number of different files, that I can then easier access. But wouldn’t it be nice if I could just seek() to the relevant position? In any of the species coordinates just to make it even nicer?
The bx-python module lets me extract sub-alignments, so
s = a.slice(begin,end)
extracts the sub-alignment of a beginning at column begin and ending at end-1.
(I can also get a sub-set of species, if you are wondering, but the method is slightly different for that).
I would love to be able to get a sub-alignment based on species coordinates instead of column numbers, so
s = a.slice(begin,end,coordinates='hg18')
would give me the alignment that maches the region [begin,end[ in humans (hg18 assembly). With a proper data representation, it should also give me this without scanning linearly through the file…
Does anyone know about a multiple alignment representation that lets me look up slices in files without scanning through all the alignment leading up to that location?
Surely genome browsers do not scan linearly through genomes to display their information — it would never be fast enough…
What do people use?
October 10th, 2008 at 2:30 pm
I think pygr does what you want. See the tutorial:
http://bioinfo.mbi.ucla.edu/pygr_0_7_b3/seq-align.html#SECTION000110000000000000000
October 10th, 2008 at 2:37 pm
Thanks for the link, Ryan, yes that looks like just the thing … I’ll check it out!
October 10th, 2008 at 6:03 pm
I believe bx-python can do exactly what you want. You simply need to index the alignments using “maf_interval_index.py” which creates an on disk interval tree (kinda sorta) which can be used to find all the alignment block offsets for given genomic region. There is even a script to extract blocks based on this (“maf_extract_ranges_indexed.py”).
You can even index and efficiently seek into MAFs that have been compressed using bzip2 or lzo. See:
http://bx-python.trac.bx.psu.edu/wiki/Example%3A%20Bzip2%20with%20indexed%20MAFs
Once you have an alignment block, you can also slice by coordinate and not column using the bx classes. The appropriate method is “slice_by_component”, and it allows you to slice by any component in the alignment.
October 10th, 2008 at 6:16 pm
Thanks a lot, James!
October 10th, 2008 at 8:23 pm
I can suggest a very efficient way of searching for features in alignments.
First, I’d recommend storing your data in a RDBMS (i prefer Postgres)
We generally think of our genomic data as mapping to a number line (being the chromosome, contig, assembly…).
A member of an alignment will at a minimum have some ID, a target sequence it aligns to(chromosome), and start/end coordinates on this target sequence. If you want to find features which span some region of interest, the direct query would be something like:
select * from alignments where chromosome = ‘chr3′, start_pos >= x1, end_pos <= x2.
Even indexed well, this is a slow query when the table gets large.
However, you can also represent your features (read alignments, conservation scores, exons, introns…) as 2d geometric features.
http://www.postgresql.org/docs/8.1/interactive/datatype-geometric.html
The above query can then become:
select * from alignments where feature_map && box’(x1,y1,x2,y2)’
Where x1 and x2 are the start and end positions, and y1 and y2 are the given depth for the chromosome on the Y axis (chr3 may have the values y1=3, y2=3.1).
!!Be careful not to use the same depth for y1 and y2, as some RDBMS’s will convert your box to a line and cause strange behavior!!
‘feature_map’ holds the box data type
This transformation allows you to use some more advanced indexing to VERY rapidly do range queries, and also offers very efficient ways to query for partial overlaps, nearest neighbors, and some other snazzy options.
These are things the above range query can’t handle easily, or at all.
http://www.postgresql.org/docs/8.1/interactive/functions-geometry.html
The last bit to know, is you will want to build an R-tree index on the geometric data column.
I hope this was clear. Using this trick enabled me to build this tool:
http://awabi.cbio.mskcc.org/cgi-bin/major/mut_char_ex.pl
which carries out range queries on tables with 100′s of millions of rows of features, and for an arbitrary query takes ~.5sec to process. The original implementation used the three value range query and took 10+ seconds.
note- Oracle also implements these geometric data types in its ‘Spatial’ package.
I actually find the Postgres implementation easier to use though.
note2- UCSC uses a bin’ing strategy to accomplish these kinds of queries, but I don’t find that as intuitive.
John Major
October 11th, 2008 at 8:17 am
Thanks John, I wasn’t even aware that I could use SQL for this — I had given up on it, really, because I didn’t expect it to be able to handle geometric data. I would essentially just be using your first example, and that is indeed very slow. I figured I would explicitly need interval trees or something … in my defence, my data base experience is very limited.
PS. Sorry if you had problems posting, but the number of links was caught by the spam filter and I didn’t see the post until this morning where I ‘un-spammed’ it.
October 13th, 2008 at 3:43 pm
cool idea to use sql’s spatial features for genomics! I’ve never seen that before.
January 22nd, 2009 at 6:28 pm
[...] do this, I am going to take John Major’s advice and use an SQL database. We have a MySQL database rather than a PostgreSQL database, so I [...]