Probabilistic Language Modeling for Protein Identification from Tandem Mass Spectra

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1 1: Language Technologies Institute 2: Machine Learning Department School of Computer Science, Carnegie Mellon University

Outline • Motivation & Background • Two probabilistic approaches • Experiments @Yiming Yang, ECML 2009, Sept 8

Motivation • Proteins are important bio-markers for diseases, drug toxicity, therapeutic outcomes, etc. • Statistical approaches have been developed for protein identification in computational proteomics • Interdisciplinary research for comparing current solutions with successful methods in IR (information retrieval) for similar problems has been rare. • We address this research gap by • Analyzing a major limitation of popular approaches in protein ID • Proposing a new solution (Language Modeling for IR) @Yiming Yang, ECML 2009, Sept 8

The Protein ID Problem • Tandem mass (MS-MS) spectra are producedusing some chemical process on an input sample (e.g., blood) • A sample typically consists of multiple proteins. • The process segments each protein into many (hundreds) pieces, called peptides. • Peptides are further decomposed into ionized segments. • The MS-MS spectrum of a peptide is a series of spikes. • Each spike is the mass/charge (m/z) ratio of an ionized segment in the peptide. @Yiming Yang, ECML 2009, Sept 8

The Protein ID Problem (cont’d) • Protein identification requires a mapping from empirical (MS-MS) spectra to protein sequences in an DB • There are many protein sequence databases • SwissProt, for example, contains 280,000+ sequences • Each protein is defined as a sequence of amino-acid letters • Peptides in each protein are specified using cleaving rules • Each peptide has an amino-acid sequence and a corresponding theoretical (“expected”) spectrum @Yiming Yang, ECML 2009, Sept 8

Theoretical Spectra of peptides in a DB Empirical Spectra of peptides in a sample Mapping Matching • Fourier Transformation • Probabilistic Models • Heuristic Rules @Yiming Yang, ECML 2009, Sept 8

Theoretical Spectra Empirical Spectra Mapping Matching Words in L1 Words in L2 Matched Words (in L2) Doc Retrieval Matched Documents (in L2) @Yiming Yang, ECML 2009, Sept 8

A Popular Approach in Protein ID(ProteinProphet by Nesvizhskii et al., 2003) • Given the predicted peptides based on MS-MS spectra, the probability for each candidate protein is estimated as: -- estimates the probability for a Boolean OR logic -- typically produces many false positives @Yiming Yang, ECML 2009, Sept 8

A Popular Approach in IR • Language Models (Ponte 1998; Lafferty & Zhai, 2001; …) • Query (q) is represented using a bag of words • Document (d) is represented using a bag of words • KL-divergence of the two words distributions (θqandθd ) is Cross entropy H (θq ||θd) -- not affect doc ranking -- a “soft” measure for the Boolean AND logic @Yiming Yang, ECML 2009, Sept 8

LM for Protein ID • Query language model for predicted peptides • Document language model for each protein sequence @Yiming Yang, ECML 2009, Sept 8

Data Sets • PPK (Purvine et al., 2003) • 2995 empirical spectra from a mixture of 35 proteins • 4535 protein sequences (325,812 unique peptides) • Mark12 • 9380 empirical spectra from a mixture of 12 proteins • 50,012 protein sequences (5,149,302 unique peptides) randomly sampled from the SwithProt database • Sigma49 • 12,498 empirical spectra from a mixture of 49 proteins • 50,049 protein sequences (2,571,642 unique peptides) randomly sampled from the SwithProt database @Yiming Yang, ECML 2009, Sept 8

Systems • Prob-AND • Our proposed method • Prob-OR • Nesvizhskii’s method, our own implementation • Conventional Vector Space Model (TFIDF-cosine) • Supported by the Lemur search engine (Callan, 2002) • X!Tandem • A popular software (online available) for protein/peptide ID • All the system, except X!Tandem, used SEQUEST to predict a set of peptides (as the “query”). • Each system produces a ranked list of proteins per query. @Yiming Yang, ECML 2009, Sept 8

Metrics • Mean Average Precision (MAP) • Standard metric in IR for evaluating ranked lists • Evaluate each ranked list from the top to each position where a true positive document is retrieved • Recall = TP/(TP + FN) • Precision = TP/(TP + FP) • TP = # of true positives, TN = # of true negatives • FP = # of false positives, FN = # of false negatives • Average the precision scores in recall intervals among 0%, 10%, 20%, …, 100% (“11-pt AVGP”) • Compute the mean of AVGP across all intervals and for all queries @Yiming Yang, ECML 2009, Sept 8

Main Results @Yiming Yang, ECML 2009, Sept 8

Statistical Significance Tests on Proportions @Yiming Yang, ECML 2009, Sept 8

Summary • The first interdisciplinary investigation/evaluation of state-of-the-art IR methods (LM and VSM) in protein identification • Prob-AND (LM) is a better choice of criterion than prob-OR in combining peptide-level evidence, improving precision significantly in the high-recall regions. • Understanding the nature of proteomic data/problems by researchers with different backgrounds (IR or ML) is hard, but, the outcome is and will be rewarding. @Yiming Yang, ECML 2009, Sept 8

Future Research • Finding the “best” protein mixture (Arnold et al., PSB 2007) • Instead of predicting each protein independently • Reduces to solving the minimum set cover problem (NP-hard) • Revised as to find the most likely protein mixture (Li et al., 2008) • Greedy approximation strategies • Using Gibbs sampling (local maxima, efficiency issues) • Better results than ProteinProphet (prob-OR) on Sigma49 • Comparative evaluation (with LM, VSM, etc.) would be informative • Scalability for high-recall predictions from very large protein databases? @Yiming Yang, ECML 2009, Sept 8

Thanks! @Yiming Yang, ECML 2009, Sept 8

Probabilistic Language Modeling for Protein Identification from Tandem Mass Spectra

Probabilistic Language Modeling for Protein Identification from Tandem Mass Spectra

Presentation Transcript

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression

Protein Identification with Mascot Software

Shape Modeling and Matching in Protein Structure Identification

Tandem Mass Spectrometry

Stakeholder Preference Modeling with Probabilistic Inversion

Protein identification from salivary stones with MALDI-TOF mass spectrometry

Protein Sequencing and Identification by Mass Spectrometry

Spectra correspond to entries in “protein identification file.xls”

Improving the Similarity Search of Tandem Mass Spectra using Metric Access Methods

Proteomics: Protein Profiling and Identification through Mass Spectrometry

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

5. MASS PEAK LIST FROM SPECTRA

A novel approach to denoising ion trap tandem mass spectra

Peptide Identification via Tandem Mass Spectrometry Sorin Istrail

Protein Identification Using Tandem Mass Spectrometry

Mass Spectrometry-Based Methods for Protein Identification

Probabilistic (=stochastic) modeling

PROTEIN IDENTIFICATION BY MASS SPECTROMETRY

protein identification

Proteomics: Protein Profiling and Identification through Mass Spectrometry

Shape Modeling and Matching in Protein Structure Identification