1 / 23

Proteogenomics : Refining and Improving Genome Annotation

Proteogenomics : Refining and Improving Genome Annotation. Samuel H Payne J Craig Venter Institute. State of Genome Annotation. Most prokaryotic genomes are auto-annotated. Sequence and function are inferred with comparative genomics; validation is sparse.

reina
Download Presentation

Proteogenomics : Refining and Improving Genome Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proteogenomics:Refining and ImprovingGenome Annotation Samuel H Payne J Craig Venter Institute

  2. State of Genome Annotation • Most prokaryotic genomes are auto-annotated. • Sequence and function are inferred with comparative genomics; validation is sparse. • Difficulties with novel or HGT genes • Mature protein features • localization • PTM, cleavage Salzberg 2007

  3. Diversity or Confusion

  4. Proteomics • Input: protein sample • Output: list of peptides

  5. Proteogenomics • Definition: using proteomics data to do genome annotation • Goals: • Find all coding regions of the genome, annotated and unannotated • Submit improved annotation to NCBI • Identify “mature protein” features

  6. Proteogenomics Protocol • Data sources • Yersinia pestis - Pieper et al., 2008, 2009 • Bacillus anthracis – PRC/NIAID   

  7. Correcting Errors • Unannotated genes • Both known and totally novel

  8. Correcting Errors • Unannotated genes • Both known and totally novel

  9. Correcting Errors • Start site assignment

  10. Exceptions to Rules • Multi-ORF genes: self splicing, frame shift

  11. Exceptions to Rules • Non-canonical start codons • infC – ATT (Sacerdot 1982, Payne 2010) in enterobacteria; ATA in Shewanella (Gupta 2007) • Deinococcus (Baudet 2009) suggests new non-standard starts

  12. Overlaps/Wrong Frames

  13. Pseudo?genes • 5 peptides (with splicing) map to a transposable element gene. Sequence alignment to an Arabidopsis Ulp1 Castellana 2008 • Expression of ABC transporter n-terminus. Missing critical motif elements.

  14. Signal Peptide • N-terminal motif, target protein for export • 1983 Perlman & Halvorson • Early basic residue, hydrophobic patch, AxB motif • A = [I,V,L,A,G,S], B = [A,G,S]

  15. Profile of an Exported Protein • Early basic residue, hydrophobic patch, motif

  16. Future • Rinse and repeat • 30 proteomes in 3 years • Stable, robust pipeline for general use • Hosted at TeraGrid

  17. When Gene Predictors Fail • Are GC extremes difficult? • 50% (Y. pestis) – 4 missed • 30’s (B. anthracis, L.interrogans) 4, 20 • 60’s (D. vulgaris, D. radiodurans) 55, 225

  18. Are They Strange? • Relative GC – does it fail on genes with different GC from others?

  19. Are They All Short?

  20. We See What We Know • Proximity to Model Organism • Yersinia/Bacillus errors: 4/4 • ‘Remote species’ errors: 20, 55, >200

  21. We See What We Know • Hypothetical vs. Named • Compare novel genes to observed proteome • Hypergeometric where Null probability is from the observed proteome

  22. Expressed Protein Resource • Protein Sequences • >30 M sequences • nr, uniprot • JCVI metagenomics • JGI genomes • 40,000 clusters • Cross referenced with proteomics, for validated proteins

  23. Acknowledgements • Eli Venter • Shih-Ting Huang, Rembert Pieper • Granger Sutton • Dick Smith, PNNL • NSF

More Related