200 likes | 320 Views
Short comparion GASP ‘99- EGASP ‘05. Martin Reese (mreese@omicia.com Omicia Inc. 5980 Horton Street Emeryville, CA 94602. The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster. Martin G. Reese (mgreese@lbl.gov) Nomi L. Harris (nlharris@lbl.gov)
E N D
Short comparion GASP ‘99- EGASP ‘05 Martin Reese (mreese@omicia.com Omicia Inc. 5980 Horton Street Emeryville, CA 94602
The challenge of annotating a complete eukaryotic genome:A case study in Drosophila melanogaster Martin G. Reese (mgreese@lbl.gov) Nomi L. Harris (nlharris@lbl.gov) George Hartzell (hartzell@cs.berkeley.edu) Suzanna E. Lewis (suzi@fruitfly.berkeley.edu) Later added: Josep April Drosophila Genome CenterDepartment of Molecular and Cell Biology539 Life Sciences AdditionUniversity of California, Berkeley
The genome annotation experiment “GASP” 1999 • Annotation of 2.9 Mb of Drosophila melanogaster genomic DNA 44 separate regions • Open to everybody, announced on several mailing lists • Participants can use any analysis methods they like (gene finding programs, homology searches, by-eye assessment, combination methods, etc.) and should disclose their methods. • “CASP” like • 12 participating groups EGASP at least 20 groups
Goals of the experiment • Compare and contrast various genome annotation methods • Objective assessment of the state of the art in gene finding and functional site prediction • Identify outstanding problems in computational methods for the annotation process
Adh contig • 2.9 Mb contiguous Drosophila sequence from the Adh region, one of the best studied genomic regions • From chromosome 2L (34D-36A) • Ashburner et al., (to appear in Genetics) • 222 gene annotations (as of July 22, 1999) ~450 genes • 375,585 bases are coding (12.95%) ENCODE region 30Mb • We chose the Adh region because it was thought to be typical. A representative test bed to evaluate annotation techniques.
Adh paper (to appear in Genetics) URL: http://www.fruitfly.org/publications/PDF/ADH.pdf
Submissions • “MAGPIE” Team: T. Gaasterland et al. • Computational Genomics Group, The Sanger Centre: V. Solovyev • University of Erlangen: U. Ohler • Genome Annotation Group, The Sanger Centre: E. Birney • Oakridge Nat. Laboratory “GRAIL”: R. Mural et al. • CBS Technical University of Denmark “HMMGene”: A. Krogh • Georgia Institute of Technology “GeneMark.hmm”: M. Borodovsky • IMIM, Spain “GeneID”: Roderic Guigó et al. • Fred Hutchinson Cancer Center “BLOCKS”: Henikoff & Henikoff • GSF, Neuherberg, Germany” M. Scherf • Mount Sinai School of Medicine”: Gary Benson • UCB/UC Santa Cruz/Neomorphic “Genie”: M. Reese and D. Kulp
Measuring success • By nucleotide • Sensitivity/Specificity (Sn/Sp) • By exon • Sn/Sp • Missed exons (ME), wrong exons (WE) • By gene • Sn/Sp • Missed genes (MG), wrong genes (WG) • Average overlap statistics • Based on Burset and Guigo (1996), “Evaluation of gene structure prediction programs”. Genomics, 34(3), 353-367.
Definition: “Joined” and “split” genes # Actual genes that overlap predicted genes JG = ------------------------------------------- • JG > 1, tendency to join multiple actual genes into one prediction • SG > 1, tendency to split actual genes into separate gene predictions # Predicted genes that overlap one or more actual genes # Predicted genes that overlap actual genes SG = ------------------------------------------- # Actual genes that overlap one or more predicted genes Inspired by Hayes and Guigó (1999), unpublished.
Results: Base level • Sensitivity: Sn 93% “9_101_1” • Low variability among predictors Sp 92% “20_79_1” • ~95% coverage of the proteome • Specificity • ~90% • Programs that are more like Genscan (used for original annotation) might do better?
Results: Exon level • Higher variability among predictors Sn 89.8% “14_87_3” • Up to ~75% sensitivity (both exon boundaries correct) • 55% specificity Sp 88% “20_78_3” • Low specificity because partial exon overlaps do not count • Missing exons below 5% • Many wrong exons (~20%)
Results: Gene level Sn 71% “36_46_1” Sp 66% “34_55_3”
Results: Gene level • 60% of actual genes predicted completely correct • Specificity only 30-40% • 5-10% missed genes (comparable to Sanger Center) • 40% wrong genes, a lot of short genes overpredicted (possibly not annotated in Standard 3) • Splitting genes is a bigger problem than joining genes Sn 71% “36_46_1” Sp 66% “34_55_3”
Discussion • Good predictive improvements • “expression” improves predictions • “gene finding” became “automatic annotation” tools • Gene sensitivity/specificity at roughly 70% is excellent • No correct answer/real golden standard (like CASP) • Superb community
Open questions • How many protein coding genes/loci missed? • How many total human protein coding loci are there? (Dro <14,500) • How much and what is the function of array detected transcripts (coding non-coding?) • Can we get an exhaustive alternative splicing “golden standard”?