1 / 25

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search. Laxman Yetukuri T-61.6070: Modeling of Proteomics Data. Outline. Motivation Basics: MS and MS/MS for Protein Identification Computational Framework of Database Search Scoring Algorithms PepHMM

dawson
Download Presentation

PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61.6070: Modeling of Proteomics Data

  2. Outline • Motivation • Basics: MS and MS/MS for Protein Identification • Computational Framework of Database Search • Scoring Algorithms • PepHMM • MOWSE • Results • Summary

  3. Motivation • Proteomics studies- dynamic and context sensitive • Speed and accuracy of omics-driven methods • High throughput MS-based approaches • Real analysis starts with protein identification • Protein identification is challenging • The heart of protein identification algorithm is scoring function

  4. Protein Identification Is Challenging • Sample Contamination • Imperfect Fragmentation • Post translational Modifications • Low signal to noise ratio • Machine errors

  5. Trypsin Digest Basics: MS and MS/MS for protein Identification Liquid Chromatography Mass Spectrometry Precursor selection + collision induced dissociation (CID) MS/MS

  6. Computational Problem Nesvizhskii and Aebersold, Drug Discovery Today, 2004, 9, 173-181

  7. yn-i bi Peptide Fragmentation: b & y ions yn-i-1 -HN-CH-CO-NH-CH-CO-NH- CH-R’ Ri i+1 R” i+1 bi+1

  8. Peptide Fragmentation: b & y ions … 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 b3 b4 y2 y3 b5 y8 y4 b8 y9 b6 b7 b9 0 m/z 250 500 750 1000

  9. yn-i zn-i bi ci Peptide Fragmentation with other ions xn-i yn-i-1 -HN-CH-CO-NH-CH-CO-NH- CH-R’ Ri i+1 R” ai i+1 bi+1

  10. Peptide Identification Two main methods for tandem MS: • De novo interpretation • Sequence database search

  11. 100 % Intensity SGF G E E E D E KL E E D L L L F 0 m/z 250 500 750 1000 De Novo Interpretation

  12. Sequence Database Search • Widely used approach • Compares peptides from a protein sequence database with experimental spectra • Scoring function summarise the comparison • Critical for any search engine • Score each peptide against spectrum • Cross correlation (SEQUEST) • MOWSE scoring and its extensions (MASCOT) • Probabilistic scoring systems (OMSSA, OLAV, ProbID…..) PepHMM is HMM based probabilistic scoring function

  13. Computational Framework for pepHMM • MSDB based peptide extraction • Hypothetical spectrum generation • b,y,y-H2O,b-H2O,b2+ and y2+ • Computing probabilistic scores • Initial classification :Match, missing or noise • Compute pepHMM scores (discussed later) • Compute Z-score • Compute E-score

  14. Contents of pepHMM Model • PepHMM combines the information on correlation among the ions, peak intensity and match tolerance • Input – sets of matches, missing and noise • Model is based on b and y ions • Each match is associated with observation (T,I) • Observation state = observed (T,I) • Hidden state =True assignement of the observations

  15. Model Structure Four possible assignments corresponding to four hidden states

  16. Model Computation Goal: Calculate highest score peptide in the database Let a path in HMM be represents configuration of states, probability of the path

  17. Model Computation… Considering all possible paths Forward algorithm: Probability of all possible Paths from the first position to state v at postion i

  18. ---Normal distribution ---Exponential distribution Emmission Probabilities Probability of observing (Tb,Ib) and (Ty, Iy) for the state 1 at position i

  19. MOWSE Scoring System MOWSE Algorithm is implemented in MASCOT software Where mi,j -elements of MOWSE frequence matrix

  20. Data Sets • ISB data set: • A,B mixtures of 18 different proteins with modifications/relative amounts • Analysed using SEQUEST and other in-house Software • Data set is curated • Final data set with charge 2+ for trypsin digestion contains 857 spectra • 5-fold cross validation by random selection • -Training set :687 spectra • -Testing set : 170 spectra • EM algorithm is used for estimating parameters

  21. Results: Distributions of Ions Noise b and y ions Match Tolerance Parameter estimates

  22. Comparative Studies Dat set selection repeated 10 times to select both training and test data set For each group parameters are similar values Prediction is considered correct if the peptide has highest score

  23. Independent Data Set A.Y’s Lab: The other independent data set for comparing with other tools like SEQUEST and MASCOT size of data set =20,980 spectra

  24. False/True Positive Rates

  25. Summary • Developed probabilistic scoring function called pepHMM for improving protein identifications • PepHMM outperform other tools like MASCOT with low false postive rate (always?) • Can this handle other type of ions other than b and y ions • Need to handle post translational modifications

More Related