Archive for October 7th, 2008

A short introduction to the Human Genome

Tuesday, October 7th, 2008

At BiRC we have a small study group where we are reading Michael Lynch's book, The Origins of Genome Architecture.  We take turns presenting a chapter, and last time it was my turn and the chaper (chapter 3) is on the Human Genome.

Below are my notes.  I've tried to translate basepairs and percentages into meters, 'cause it helps me visualise the relative sizes.  Even though it still boggles my mind a bit that the human genome is about 2m if you stretch it out.  Well, that can't be helped.  Three billion is a lot, even if it is three billion very tiny things...

The human genome

The human genome is about 3Gbp which is about 2m if you stretch it out.  Of this, about 1%, or 2cm, is coding. That doesn't mean that only 1% is genes, 'cause most of genes (even protein coding genes) are not actually coding.  We'll get to that later.

There is nothing unusual about the human genome.  Not compared to other multicelular eukariotes, at least.  It is not the largest or the smallest and it does not have an unual number of genes.  In fact, it is common as mud.

The genes are grouped into families, where the families are phylgenetically related.  Some people have argued that the number of genes in a family tells you something about the importance of the family, but the distribution of family sizes could just as easily be explained by a stochastic birth/death process, at least if the gene families have different birth/death ratios.  (This begs the question of why they should have that, but the chapter doesn't say and I don't know).

Introns and exons

The protein coding genes consist of introns and exons.  The exons are what is left after the transcript is spliced and the introns is what is thrown out.

A personal comment: remember that exon is not synonymous with coding.  I've seen people confused about this (and made the mistake myself in a script or two).  The coding sequence starts somewhere along the sequence of exons and stops somewhere before the exons end. The coding sequence is a sub-sequence of the exons.

Back to the chapter...

The average exon length is about 0.15kbp while the average intron length is about 4.66kbp. So introns are on average about 40 times as long as exons.  If we imagine (not to scale this time) that the exons are 1 cm long, then we have 40 cm of introns between them.

That is a lot of intron...

Since complex organisms doesn't seem to have dramatically more coding genes than simpler organisms, alternative splicing has been suggested to explain the complexity.

On average we have 2.6 different splice products per gene.  As far as we know, there is less alternative splicing in C. elegans and D. melanogaster, but then we know a lot more about alternative splicing in humans and mice that we do in any other organism, so we are boundt to have seen more rare splice variants here and the extra splicing we are seeing might easily be a selection bias.

We don't really know how much of the alternative splicing is functionally important and how much is random "noise".

Regulatory DNA

How much regulatory DNA do we have?  Based on conserved regions in the genome (which is probably a very conservative estimate) we have a few percent of the genome.

It seems, however, that the fraction of the genome that is conserved increases with organism complexity, so perhaps complexity is all in regulation?

Everything and the kitchen sink seems to be transcribed, and most of it differentially expressed, but how much of this is spurious we do not know.

We do know about different non-coding RNA genes, such as miRNA.  Although miRNA is only found in multi-cellular organisms, RNA interference is found in all domains of life.

Even accounting for regulatory DNA, 95% of the genome -- 1.9 m -- has no function that we know of.  Know of being the key word here, of course.  Don't call it junk before you know that it is really doing nothing...

Mobile elements

There are about 100 times as many mobile elements as there are coding genes, and 75% -- 1.5 meter -- of the genome is a product of some mobile element or other.

(I have a lot of notes about the different kinds, but quite frankly I find it a bit dull, so I am not going to mention it here...)

Pseudogenes

There are about half as many pseudogenes as coding genes.

There are various mechanisms for introducing pseudogenes in the genome, including re-insertion of cDNA, tandem duplications and just plain inactivation of a gene.

The first two cases are likely to introduce pseudogenes "dead on arrival".  A re-inserted cDNA won't have the regulatory mechanisms to get transcribed, and a duplication is likely to disrupt it as well.

Pseudogenes seem to have a halflife of ~37MYA (halflife of the time it takes till we don't recognize it as a copy of another gene) compared to a halflife of ~7.5MYA for active duplicated genes.  This seems to indicate that there isn't much selection working against dead-on-arrival genes compared to duplicating an active gene and thus potentially doubling its product.

Human evolution

The chapter closes with an analysis of the human lineage and our evolution.  (Can we finally find something that makes us special?)

It seems that we have seen an increase in substitution rate on our line, but Lynch argues that what we are seeing is not so much adaptive selection but rather a reduced negative selection.

There can be different explanations for this.  A reduced population size can have reduced the effects of selection. A change in behaviour could have changed our fitness landscape, so traits that would normally be selected strongly against changing are now free to vary.

Damn stupid bug

Tuesday, October 7th, 2008

The last week I've been running scripts to analyse a whole genome alignment, and I got some weird results.  In some parts of the alignment, I was seeing way too much sequence divergence than there should be.

Now, manually inspecting 2Gbp of alignment is not an option, so I wrote a program to plot various summaries of chunks of the alignment.

Still, it is a lot of alignment, so scanning the genome with the plots didn't work either, although the plots clearly showed that something was wrong.

I've been scratching my head about this for a while, and it wasn't until I was heading to bed yesterday that it hit me: parts of the sequences are repeat-masked, so the case varies in the alignment. I consider A != a so I am seeing more variation than is really there!

I feel so stupid now.  If only I had taken a look at the actual sequences, this would have been obvious! But I didn't, I wrote a program to summaries the data instead.

Stupid stupid stupid...