1 / 35

PepArML: A model-free, result-combining peptide identification arbiter via machine learning

PepArML: A model-free, result-combining peptide identification arbiter via machine learning. Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland, College Park, and Georgetown University Medical Center. SEQUEST. Mascot. 28%. 14%. 14%. 38%. 1%. 3%. 2%. X! Tandem.

pooky
Download Presentation

PepArML: A model-free, result-combining peptide identification arbiter via machine learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland, College Park, and Georgetown University Medical Center

  2. SEQUEST Mascot 28% 14% 14% 38% 1% 3% 2% X! Tandem Comparison of Search Engines • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment • Many spectra lack any peptide assignment Searle et al. JPR 7(1), 2008

  3. Black-box Techniques • Significance re-estimation • Target-Decoy search • Bimodal distribution fit • Supervised machine learning • Train predictors on synthetic datasets • Select and/or create (many) good features • Result combiners • Incorrect peptide IDs unlikely to match • Significance re-estimation • Independence and/or supervised model

  4. PepArML • Unified machine learning result combiner • Significance re-estimation too! • Model-free feature use and result combination • Use agreement and features if useful • Unsupervised training procedure • No loss of classification performance

  5. PepArML Overview X!Tandem PepArML Mascot OMSSA Other

  6. PepArML Overview Feature extraction X!Tandem PepArML Mascot OMSSA Other

  7. Dataset Construction X!Tandem Mascot OMSSA T F T …… T

  8. Dataset Construction • Calibrant 8 Protein Mix (C8) • 4594 MS/MS spectra (LTQ) • 618 (11.2%) true positives • Sashimi 17mix_test2 (S17) • 1389 MS/MS spectra (Q-TOF) • 354 (25.4%) true positives • AURUM 1.0 (364 Proteins) • 7508 MS/MS spectra (MALDI-TOF-TOF) • 3775 (50.3%) true positives

  9. PepArML Machine Learning • Machine learning (generally) helps single search engines • PepArML result-combiner (C-TMO) improves on single search engines • Sometimes combining two search engines works as well, or better, than three

  10. PepArML vs Search Engines (C8)

  11. True vs. Est. FDR (C-TMO, C8)

  12. PepArML vs Search Engines (C8)

  13. PepArML Pairs vs PepArML (C8)

  14. Sensitivity Comparison

  15. Feature Evaluation Tandem Mascot OMSSA

  16. Application to Real Data • How well do these models generalize? • Different instruments • Spectral characteristics change scores • Search parameters • Different parameters change score values • Supervised learning requires • (Synthetic) experimental data from every instrument • Search results from available search engines • Training/models for all parameters x search engine sets x instruments

  17. Model Generalization Train S17 / Score S17 Train C8 / Score S17

  18. Rescuing Machine Learning • Train a new machine learning model for every dataset! • Generalization not required • No predetermined search engines, parameters, instruments, features • Perhaps we can “guess” the true proteins • Most proteins not in doubt • Machine learning can tolerate imperfect labels

  19. Unsupervised Learning

  20. Unsupervised Learning (S17)

  21. Unsupervised Learning (S17)

  22. Protein Selection Heuristic • Modeled on typical protein identification criteria • High confidence peptide IDs • At least 2 non-overlapping peptides • At least 10% sequence coverage • Robust, fast convergence • Easily enforce additional constraints

  23. What about real data? Dr. Rado Goldman (LCCC, GUMC) • Proteolytic serum peptides from clinical hepatocellular carcinoma samples • ~ 200 MALDI MS/MS Spectra (TOF-TOF) PepArML for non-specific search of IPI-Human • Increase in confidence & sensitivity • Observation of “ragged” proteolytic trimming

  24. Protein Identification Example M T O *

  25. Future Directions • Apply to more experimental datasets • Integrate • novel features • new search engines, spectral matching • multiple searches with varied parameters, sequence databases • Construct meta-search engine • FDR by bimodal fit instead of decoys • Release as open source • http://peparml.sourceforge.org

  26. http://PepArML.SourceForge.Net

  27. Acknowledgements • Xue Wu* & Dr. Chau-Wen Tseng, • Computer ScienceUniversity of Maryland, College Park • Dr. Brian Balgley, Dr. Paul Rudnick • Calibrant Biosystems & NIST • Dr. Rado Goldman, Dr. Yanming An • Department of OncologyGeorgetown University Medical Center • Kam Ho To • Biochemistry Masters studentGeorgetown University • Funding: NIH/NCI CPTAC

  28. PepArML vs Search Engines (S17)

  29. PepArML vs Search Engines (S17)

  30. PepArML Pairs vs PepArML (C8)

  31. PepArML Pairs vs PepArML (S17)

  32. PepArML Pairs vs PepArML (S17)

  33. Unsupervised Learning (C8)

  34. Unsupervised Learning (C8)

More Related