240 likes | 494 Views
Proteogenomics : Refining and Improving Genome Annotation. Samuel H Payne J Craig Venter Institute. State of Genome Annotation. Most prokaryotic genomes are auto-annotated. Sequence and function are inferred with comparative genomics; validation is sparse.
E N D
Proteogenomics:Refining and ImprovingGenome Annotation Samuel H Payne J Craig Venter Institute
State of Genome Annotation • Most prokaryotic genomes are auto-annotated. • Sequence and function are inferred with comparative genomics; validation is sparse. • Difficulties with novel or HGT genes • Mature protein features • localization • PTM, cleavage Salzberg 2007
Proteomics • Input: protein sample • Output: list of peptides
Proteogenomics • Definition: using proteomics data to do genome annotation • Goals: • Find all coding regions of the genome, annotated and unannotated • Submit improved annotation to NCBI • Identify “mature protein” features
Proteogenomics Protocol • Data sources • Yersinia pestis - Pieper et al., 2008, 2009 • Bacillus anthracis – PRC/NIAID
Correcting Errors • Unannotated genes • Both known and totally novel
Correcting Errors • Unannotated genes • Both known and totally novel
Correcting Errors • Start site assignment
Exceptions to Rules • Multi-ORF genes: self splicing, frame shift
Exceptions to Rules • Non-canonical start codons • infC – ATT (Sacerdot 1982, Payne 2010) in enterobacteria; ATA in Shewanella (Gupta 2007) • Deinococcus (Baudet 2009) suggests new non-standard starts
Pseudo?genes • 5 peptides (with splicing) map to a transposable element gene. Sequence alignment to an Arabidopsis Ulp1 Castellana 2008 • Expression of ABC transporter n-terminus. Missing critical motif elements.
Signal Peptide • N-terminal motif, target protein for export • 1983 Perlman & Halvorson • Early basic residue, hydrophobic patch, AxB motif • A = [I,V,L,A,G,S], B = [A,G,S]
Profile of an Exported Protein • Early basic residue, hydrophobic patch, motif
Future • Rinse and repeat • 30 proteomes in 3 years • Stable, robust pipeline for general use • Hosted at TeraGrid
When Gene Predictors Fail • Are GC extremes difficult? • 50% (Y. pestis) – 4 missed • 30’s (B. anthracis, L.interrogans) 4, 20 • 60’s (D. vulgaris, D. radiodurans) 55, 225
Are They Strange? • Relative GC – does it fail on genes with different GC from others?
We See What We Know • Proximity to Model Organism • Yersinia/Bacillus errors: 4/4 • ‘Remote species’ errors: 20, 55, >200
We See What We Know • Hypothetical vs. Named • Compare novel genes to observed proteome • Hypergeometric where Null probability is from the observed proteome
Expressed Protein Resource • Protein Sequences • >30 M sequences • nr, uniprot • JCVI metagenomics • JGI genomes • 40,000 clusters • Cross referenced with proteomics, for validated proteins
Acknowledgements • Eli Venter • Shih-Ting Huang, Rembert Pieper • Granger Sutton • Dick Smith, PNNL • NSF