250 likes | 346 Views
IDENTIFICATION OF GENES IN GENOMIC DATA. TOO MUCH OF A GOOD THING. RELEVANT NUMBERS. E. coli genome Four million base pairs 4000 genes Human genome Three billion base pairs 25,000 genes. GENE STRUCTURES. E. coli - WYSIWYG Human - Life is complicated Exons and Introns
E N D
IDENTIFICATION OF GENES IN GENOMIC DATA TOO MUCH OF A GOOD THING
RELEVANT NUMBERS • E. coli genome • Four million base pairs • 4000 genes • Human genome • Three billion base pairs • 25,000 genes
GENE STRUCTURES • E. coli - WYSIWYG • Human - Life is complicated • Exons and Introns • Alternative splicing • Alternative starts and stops
MICROBIAL WORK • Not trivial but more direct than human • Many approaches • GRAIL a first attempt • Web based searches • Markov most common
General Markov Model • Probabilistic model • Uses adjacent base(s) to predict current base • Order of model depends on number of bases examined • Sum the probabilities for each base • High score wins (recognized as gene)
Second Order Probabilities(i I i-2, i-1) • 64 possibilities • kth order needs 4k+1 probabilities • DNA actually needs six models (six reading frames) – so six times all probabilities • Need known genes to determine real probabilities (train the model)
Example • 5th order model needs 24,576 probabilities 4096 hexamers x 6 frames Do these occur frequently enough in identified coding regions to give good probabilities?
Scan sequence and determine scores for each region based on probabilities • Scores above threshold declared genes because they are like previously identified genes • Implies a hidden relationship
GLIMMER and GLIMMERM • Use interpolated Markov models • If oligomers are available, will score up to 8th order • M version works for small eukaryotes • Plasmodium falciparum • Arabidopsis thaliana • Adds information about splice sites
GENESCAN • Combines multiple approaches • 5th order Markov model • Staden weight matrix to model cis sites • Poly A site • TATA and INR • CAP site • Translation termination sites • Maximal Dependence Decomposition • Donor and acceptor splice site modeling
A Tale of Two Genomes • Private genome – Celera • Public genome – Rest of the world • DRAFT ONLY (except 21 and 22)
Celera • Otto-Refseq • Gold standard – full length, curated cDNAs • Otto-Homology • ESTs • cDNAs • Mouse-human genomic similarities • Known proteins
Celera • De novo • GENESCAN • GRAIL • FGENESH • Manual curation
Celera results • De novo prediction of 76,400 genes (58,000 appeared to be new) • 21,350 supported by some other evidence • Otto homologies identify 17,764 other genes • Total is 39,114
Public • De novo using ENSEBL • Merge predictions with predictions from GENIE • Merge results with known genes in databases • Eliminate bacterial sequences