490 likes | 585 Views
Improving the Sensitivity of Peptide Identification for Genome Annotation. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Why Tandem Mass Spectrometry?.
E N D
Improving the Sensitivityof Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
Why Tandem Mass Spectrometry? • MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. • Key concepts: • Spectrum acquisition is unbiased • Direct observation of amino-acid sequence • Sensitive to small sequence variations
Sample + _ Detector Ionizer Mass Analyzer Mass Spectrometer ElectronMultiplier(EM) Time-Of-Flight (TOF) Quadrapole Ion-Trap MALDI Electro-SprayIonization (ESI)
Enzymatic Digest and Fractionation Sample Preparation for MS/MS
Tandem Mass Spectrometry(MS/MS) Precursor selection
Tandem Mass Spectrometry(MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS
Unannotated Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications
Splice Isoform Anomaly • Human erythroleukemia K562 cell-line • Depth of coverage study • Resing et al. Anal. Chem. 2004. • Peptide Atlas A8_IP • SALT1A2 gene: • Sulfotransferase family, cytosolic, 1A • 2 ESTs, 1 mRNA • mRNA from lung, small cell-cancinoma sample • Single (significant) peptide identification • Five agreeing search engines • PepArML FDR < 1%. • All source engines have non-significant E-values
Translation start-site correction • Halobacterium sp. NRC-1 • Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins • Goo, et al. MCP 2003. • GdhA1 gene: • Glutamate dehydrogenase A1 • Multiple significant peptide identifications • Observed start is consistent with Glimmer 3.0 prediction(s)
Halobacterium sp. NRC-1ORF: GdhA1 • K-score E-value vs PepArML @ 10% FDR • Many peptides inconsistent with annotated translation start site of NP_279651
Phyloproteomics • Tandem mass-spectra of proteins (top-down) • High-accuracy instrument (Orbitrap, UMD Core) • Proteins from unsequenced bacteria matching identical proteins in related organisms • Demonstration using Y.rohdei.
Phyloproteomics Protein Sequence 16S-rRNA Sequence phylogeny.fr – "One-Click"
Phyloproteomics • Recent extension to highly homologous proteins in related organisms • Merely require N- and/or C-terminus in common • Broadens applicability considerably • Phyloproteomic trees for E.herbicola and Enterocloacae, neither sequenced. • New paradigm for phylogenetic analysis?
Lost peptide identifications • Missing from the sequence database • Search engine strengths, weaknesses, quirks • Poor score or statistical significance • Thorough search takes too long
Searching under the street-light… • Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! • Searching traditional protein sequence databases biases the results in favor ofwell-understoodand/orcomputationally predicted proteins and protein isoforms!
Peptide Sequence Databases • All amino-acid 30-mers, no redundancy • From ESTs, Proteins, mRNAs • 30-40 fold size, search time reduction • Formatted as a FASTA sequence database • One entry per gene/cluster.
We can observe evidence for… • Known coding SNPs • Unannotated coding mutations • Alternate splicing isoforms • Alternate/Incorrect translation start-sites • Microexons • Alternate/Incorrect translation frames …though it must be treated thoughtfully.
PeptideMapper Web Service I’m Feeling Lucky
PeptideMapper Web Service I’m Feeling Lucky
PeptideMapper Web Service I’m Feeling Lucky
PeptideMapper Web Service • Suffix-tree index on peptide sequence database • Fast peptide to gene/cluster mapping • “Compression” makes this feasible • Peptide alignment with cluster evidence • Amino-acid or nucleotide; exact & near-exact • Genomic-loci mapping via • UCSC “known-gene” transcripts, and • Predetermined, embedded genomic coordinates
SEQUEST Mascot 28% 14% 14% 38% 1% 3% 2% X! Tandem Comparison of search engine results • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment Searle et al. JPR 7(1), 2008
Combining search engine results – harder than it looks! • Consensus boosts confidence, but... • How to assess statistical significance? • Gain specificity, but lose sensitivity! • Incorrect identifications are correlated too! • How to handle weak identifications? • Consensus vs disagreement vs abstention • Threshold at some significance? • We apply unsupervised machine-learning.... • Lots of related work unified in a single framework.
Running many search engines Search engine configuration can be difficult: • Correct spectral format • Search parameter files and command-line • Pre-processed sequence databases. • Tracking spectrum identifiers • Extracting peptide identifications, especially modifications and protein identifiers
Simple unified search interface for: Mascot, X!Tandem, K-Score, OMSSA, MyriMatch Automatic decoy searches Automatic spectrumfile "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid Peptide Identification Meta-Search
PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 48+ CPUs Scales easily to 250+ simultaneoussearches Single, simplesearch request UMIACS 250+ CPUs
PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 80+ CPUs Scales easily to 250+ simultaneoussearches Single, simplesearch request
PepArML Meta-Search Engine Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple searchrequest UMIACS 250+ CPUs
PepArML Meta-Search Engine Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple searchrequest UMIACS 250+ CPUs
Peptide Identification Grid-Enabled Meta-Search • Access to high-performance computing resources for the proteomics community • NSF TeraGrid Community Portal • University/Institute HPC clusters • Individual lab compute resources • Contribute cycles to the community and get access to others’ cycles in return. • Centralized scheduler • Compute capacity can still be exclusive, or prioritized. • Compute client plays well with HPC grid schedulers.
PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). Edwards Lab Scheduler & 80+ CPUs NSF TeraGrid 1000+ CPUs X!Tandem, KScore, OMSSA, MyriMatch. X!Tandem, KScore, OMSSA. UMIACS 250+ CPUs
PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). Edwards Lab Scheduler & 80+ CPUs NSF TeraGrid 1000+ CPUs X!Tandem, KScore, OMSSA, MyriMatch. X!Tandem, KScore, OMSSA. UMIACS 250+ CPUs
PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). Edwards Lab Scheduler & 80+ CPUs NSF TeraGrid 1000+ CPUs X!Tandem, KScore, OMSSA, MyriMatch. X!Tandem, KScore, OMSSA. UMIACS 250+ CPUs UMD Proteomics Core Scheduler & 2 CPUs
PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). Edwards Lab Scheduler & 80+ CPUs NSF TeraGrid 1000+ CPUs X!Tandem, KScore, OMSSA, MyriMatch. X!Tandem, KScore, OMSSA. UMIACS 250+ CPUs UMD Proteomics Core Scheduler & 2 CPUs
Conclusions Improve the scope and sensitivity of peptide identification for genome annotation, using • Exhaustive peptide sequence databases • Machine-learning for combining • Meta-search tools to maximize consensus • Grid-computing for thorough search http://edwardslab.bmcb.georgetown.edu
Acknowledgements • Dr. Catherine Fenselau & students • University of Maryland Biochemistry • Dr. Yan Wang • University of Maryland Proteomics Core • Dr. Art Delcher • University of Maryland CBCB • Dr. Chau-Wen Tseng & Dr. Xue Wu • University of Maryland Computer Science • Funding: NIH/NCI