1 / 20

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling. Yiming Yang 1,2 , Abhay Harpale 1 and Subramanian Ganaphathy 1 1: Language Technologies Institute 2: Machine Learning Department School of Computer Science, Carnegie Mellon University. Outline.

javiles
Download Presentation

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1 1: Language Technologies Institute 2: Machine Learning Department School of Computer Science, Carnegie Mellon University

  2. Outline • Motivation & Background • Two probabilistic approaches • Experiments @Yiming Yang, ECML 2009, Sept 8

  3. Motivation • Proteins are important bio-markers for diseases, drug toxicity, therapeutic outcomes, etc. • Statistical approaches have been developed for protein identification in computational proteomics • Interdisciplinary research for comparing current solutions with successful methods in IR (information retrieval) for similar problems has been rare. • We address this research gap by • Analyzing a major limitation of popular approaches in protein ID • Proposing a new solution (Language Modeling for IR) @Yiming Yang, ECML 2009, Sept 8

  4. The Protein ID Problem • Tandem mass (MS-MS) spectra are producedusing some chemical process on an input sample (e.g., blood) • A sample typically consists of multiple proteins. • The process segments each protein into many (hundreds) pieces, called peptides. • Peptides are further decomposed into ionized segments. • The MS-MS spectrum of a peptide is a series of spikes. • Each spike is the mass/charge (m/z) ratio of an ionized segment in the peptide. @Yiming Yang, ECML 2009, Sept 8

  5. The Protein ID Problem (cont’d) • Protein identification requires a mapping from empirical (MS-MS) spectra to protein sequences in an DB • There are many protein sequence databases • SwissProt, for example, contains 280,000+ sequences • Each protein is defined as a sequence of amino-acid letters • Peptides in each protein are specified using cleaving rules • Each peptide has an amino-acid sequence and a corresponding theoretical (“expected”) spectrum @Yiming Yang, ECML 2009, Sept 8

  6. Theoretical Spectra of peptides in a DB Empirical Spectra of peptides in a sample Mapping Matching • Fourier Transformation • Probabilistic Models • Heuristic Rules @Yiming Yang, ECML 2009, Sept 8

  7. Theoretical Spectra Empirical Spectra Mapping Matching Words in L1 Words in L2 Matched Words (in L2) Doc Retrieval Matched Documents (in L2) @Yiming Yang, ECML 2009, Sept 8

  8. Outline • Motivation & Background • Two probabilistic approaches • Experiments @Yiming Yang, ECML 2009, Sept 8

  9. A Popular Approach in Protein ID(ProteinProphet by Nesvizhskii et al., 2003) • Given the predicted peptides based on MS-MS spectra, the probability for each candidate protein is estimated as: -- estimates the probability for a Boolean OR logic -- typically produces many false positives @Yiming Yang, ECML 2009, Sept 8

  10. A Popular Approach in IR • Language Models (Ponte 1998; Lafferty & Zhai, 2001; …) • Query (q) is represented using a bag of words • Document (d) is represented using a bag of words • KL-divergence of the two words distributions (θqandθd ) is Cross entropy H (θq ||θd) -- not affect doc ranking -- a “soft” measure for the Boolean AND logic @Yiming Yang, ECML 2009, Sept 8

  11. LM for Protein ID • Query language model for predicted peptides • Document language model for each protein sequence @Yiming Yang, ECML 2009, Sept 8

  12. Outline • Motivation & Background • Two probabilistic approaches • Experiments @Yiming Yang, ECML 2009, Sept 8

  13. Data Sets • PPK (Purvine et al., 2003) • 2995 empirical spectra from a mixture of 35 proteins • 4535 protein sequences (325,812 unique peptides) • Mark12 • 9380 empirical spectra from a mixture of 12 proteins • 50,012 protein sequences (5,149,302 unique peptides) randomly sampled from the SwithProt database • Sigma49 • 12,498 empirical spectra from a mixture of 49 proteins • 50,049 protein sequences (2,571,642 unique peptides) randomly sampled from the SwithProt database @Yiming Yang, ECML 2009, Sept 8

  14. Systems • Prob-AND • Our proposed method • Prob-OR • Nesvizhskii’s method, our own implementation • Conventional Vector Space Model (TFIDF-cosine) • Supported by the Lemur search engine (Callan, 2002) • X!Tandem • A popular software (online available) for protein/peptide ID • All the system, except X!Tandem, used SEQUEST to predict a set of peptides (as the “query”). • Each system produces a ranked list of proteins per query. @Yiming Yang, ECML 2009, Sept 8

  15. Metrics • Mean Average Precision (MAP) • Standard metric in IR for evaluating ranked lists • Evaluate each ranked list from the top to each position where a true positive document is retrieved • Recall = TP/(TP + FN) • Precision = TP/(TP + FP) • TP = # of true positives, TN = # of true negatives • FP = # of false positives, FN = # of false negatives • Average the precision scores in recall intervals among 0%, 10%, 20%, …, 100% (“11-pt AVGP”) • Compute the mean of AVGP across all intervals and for all queries @Yiming Yang, ECML 2009, Sept 8

  16. Main Results @Yiming Yang, ECML 2009, Sept 8

  17. Statistical Significance Tests on Proportions @Yiming Yang, ECML 2009, Sept 8

  18. Summary • The first interdisciplinary investigation/evaluation of state-of-the-art IR methods (LM and VSM) in protein identification • Prob-AND (LM) is a better choice of criterion than prob-OR in combining peptide-level evidence, improving precision significantly in the high-recall regions. • Understanding the nature of proteomic data/problems by researchers with different backgrounds (IR or ML) is hard, but, the outcome is and will be rewarding. @Yiming Yang, ECML 2009, Sept 8

  19. Future Research • Finding the “best” protein mixture (Arnold et al., PSB 2007) • Instead of predicting each protein independently • Reduces to solving the minimum set cover problem (NP-hard) • Revised as to find the most likely protein mixture (Li et al., 2008) • Greedy approximation strategies • Using Gibbs sampling (local maxima, efficiency issues) • Better results than ProteinProphet (prob-OR) on Sigma49 • Comparative evaluation (with LM, VSM, etc.) would be informative • Scalability for high-recall predictions from very large protein databases? @Yiming Yang, ECML 2009, Sept 8

  20. Thanks! @Yiming Yang, ECML 2009, Sept 8

More Related