PepArML: A model-free, result-combining peptide identification arbiter via machine learning

PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland, College Park, and Georgetown University Medical Center

SEQUEST Mascot 28% 14% 14% 38% 1% 3% 2% X! Tandem Comparison of Search Engines • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment • Many spectra lack any peptide assignment Searle et al. JPR 7(1), 2008

Black-box Techniques • Significance re-estimation • Target-Decoy search • Bimodal distribution fit • Supervised machine learning • Train predictors on synthetic datasets • Select and/or create (many) good features • Result combiners • Incorrect peptide IDs unlikely to match • Significance re-estimation • Independence and/or supervised model

PepArML • Unified machine learning result combiner • Significance re-estimation too! • Model-free feature use and result combination • Use agreement and features if useful • Unsupervised training procedure • No loss of classification performance

PepArML Overview X!Tandem PepArML Mascot OMSSA Other

PepArML Overview Feature extraction X!Tandem PepArML Mascot OMSSA Other

Dataset Construction X!Tandem Mascot OMSSA T F T …… T

Dataset Construction • Calibrant 8 Protein Mix (C8) • 4594 MS/MS spectra (LTQ) • 618 (11.2%) true positives • Sashimi 17mix_test2 (S17) • 1389 MS/MS spectra (Q-TOF) • 354 (25.4%) true positives • AURUM 1.0 (364 Proteins) • 7508 MS/MS spectra (MALDI-TOF-TOF) • 3775 (50.3%) true positives

PepArML Machine Learning • Machine learning (generally) helps single search engines • PepArML result-combiner (C-TMO) improves on single search engines • Sometimes combining two search engines works as well, or better, than three

PepArML vs Search Engines (C8)

True vs. Est. FDR (C-TMO, C8)

PepArML vs Search Engines (C8)

PepArML Pairs vs PepArML (C8)

Sensitivity Comparison

Feature Evaluation Tandem Mascot OMSSA

Application to Real Data • How well do these models generalize? • Different instruments • Spectral characteristics change scores • Search parameters • Different parameters change score values • Supervised learning requires • (Synthetic) experimental data from every instrument • Search results from available search engines • Training/models for all parameters x search engine sets x instruments

Model Generalization Train S17 / Score S17 Train C8 / Score S17

Rescuing Machine Learning • Train a new machine learning model for every dataset! • Generalization not required • No predetermined search engines, parameters, instruments, features • Perhaps we can “guess” the true proteins • Most proteins not in doubt • Machine learning can tolerate imperfect labels

Unsupervised Learning

Unsupervised Learning (S17)

Protein Selection Heuristic • Modeled on typical protein identification criteria • High confidence peptide IDs • At least 2 non-overlapping peptides • At least 10% sequence coverage • Robust, fast convergence • Easily enforce additional constraints

What about real data? Dr. Rado Goldman (LCCC, GUMC) • Proteolytic serum peptides from clinical hepatocellular carcinoma samples • ~ 200 MALDI MS/MS Spectra (TOF-TOF) PepArML for non-specific search of IPI-Human • Increase in confidence & sensitivity • Observation of “ragged” proteolytic trimming

Protein Identification Example M T O *

Future Directions • Apply to more experimental datasets • Integrate • novel features • new search engines, spectral matching • multiple searches with varied parameters, sequence databases • Construct meta-search engine • FDR by bimodal fit instead of decoys • Release as open source • http://peparml.sourceforge.org

http://PepArML.SourceForge.Net

Acknowledgements • Xue Wu* & Dr. Chau-Wen Tseng, • Computer ScienceUniversity of Maryland, College Park • Dr. Brian Balgley, Dr. Paul Rudnick • Calibrant Biosystems & NIST • Dr. Rado Goldman, Dr. Yanming An • Department of OncologyGeorgetown University Medical Center • Kam Ho To • Biochemistry Masters studentGeorgetown University • Funding: NIH/NCI CPTAC

PepArML vs Search Engines (S17)

PepArML Pairs vs PepArML (C8)

PepArML Pairs vs PepArML (S17)

Unsupervised Learning (C8)

PepArML: A model-free, result-combining peptide identification arbiter via machine learning