1 / 25

IDENTIFICATION OF GENES IN GENOMIC DATA

IDENTIFICATION OF GENES IN GENOMIC DATA. TOO MUCH OF A GOOD THING. RELEVANT NUMBERS. E. coli genome Four million base pairs 4000 genes Human genome Three billion base pairs 25,000 genes. GENE STRUCTURES. E. coli - WYSIWYG Human - Life is complicated Exons and Introns

lars-barton
Download Presentation

IDENTIFICATION OF GENES IN GENOMIC DATA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IDENTIFICATION OF GENES IN GENOMIC DATA TOO MUCH OF A GOOD THING

  2. RELEVANT NUMBERS • E. coli genome • Four million base pairs • 4000 genes • Human genome • Three billion base pairs • 25,000 genes

  3. GENE STRUCTURES • E. coli - WYSIWYG • Human - Life is complicated • Exons and Introns • Alternative splicing • Alternative starts and stops

  4. HUMAN GENES

  5. MICROBIAL WORK • Not trivial but more direct than human • Many approaches • GRAIL a first attempt • Web based searches • Markov most common

  6. One-State Markov Model

  7. First Order Probabilities(i I i-1)

  8. General Markov Model • Probabilistic model • Uses adjacent base(s) to predict current base • Order of model depends on number of bases examined • Sum the probabilities for each base • High score wins (recognized as gene)

  9. First Order Probabilities(i I i-1)

  10. Second Order Probabilities(i I i-2, i-1) • 64 possibilities • kth order needs 4k+1 probabilities • DNA actually needs six models (six reading frames) – so six times all probabilities • Need known genes to determine real probabilities (train the model)

  11. Example • 5th order model needs 24,576 probabilities 4096 hexamers x 6 frames Do these occur frequently enough in identified coding regions to give good probabilities?

  12. Scan sequence and determine scores for each region based on probabilities • Scores above threshold declared genes because they are like previously identified genes • Implies a hidden relationship

  13. GLIMMER and GLIMMERM • Use interpolated Markov models • If oligomers are available, will score up to 8th order • M version works for small eukaryotes • Plasmodium falciparum • Arabidopsis thaliana • Adds information about splice sites

  14. GENESCAN • Combines multiple approaches • 5th order Markov model • Staden weight matrix to model cis sites • Poly A site • TATA and INR • CAP site • Translation termination sites • Maximal Dependence Decomposition • Donor and acceptor splice site modeling

  15. A Tale of Two Genomes • Private genome – Celera • Public genome – Rest of the world • DRAFT ONLY (except 21 and 22)

  16. Celera • Otto-Refseq • Gold standard – full length, curated cDNAs • Otto-Homology • ESTs • cDNAs • Mouse-human genomic similarities • Known proteins

  17. Celera • De novo • GENESCAN • GRAIL • FGENESH • Manual curation

  18. Celera results • De novo prediction of 76,400 genes (58,000 appeared to be new) • 21,350 supported by some other evidence • Otto homologies identify 17,764 other genes • Total is 39,114

  19. Public • De novo using ENSEBL • Merge predictions with predictions from GENIE • Merge results with known genes in databases • Eliminate bacterial sequences

  20. Public Integrated Gene Index (IGI)

More Related