350 likes | 457 Views
PepArML: A model-free, result-combining peptide identification arbiter via machine learning. Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland, College Park, and Georgetown University Medical Center. SEQUEST. Mascot. 28%. 14%. 14%. 38%. 1%. 3%. 2%. X! Tandem.
E N D
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland, College Park, and Georgetown University Medical Center
SEQUEST Mascot 28% 14% 14% 38% 1% 3% 2% X! Tandem Comparison of Search Engines • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment • Many spectra lack any peptide assignment Searle et al. JPR 7(1), 2008
Black-box Techniques • Significance re-estimation • Target-Decoy search • Bimodal distribution fit • Supervised machine learning • Train predictors on synthetic datasets • Select and/or create (many) good features • Result combiners • Incorrect peptide IDs unlikely to match • Significance re-estimation • Independence and/or supervised model
PepArML • Unified machine learning result combiner • Significance re-estimation too! • Model-free feature use and result combination • Use agreement and features if useful • Unsupervised training procedure • No loss of classification performance
PepArML Overview X!Tandem PepArML Mascot OMSSA Other
PepArML Overview Feature extraction X!Tandem PepArML Mascot OMSSA Other
Dataset Construction X!Tandem Mascot OMSSA T F T …… T
Dataset Construction • Calibrant 8 Protein Mix (C8) • 4594 MS/MS spectra (LTQ) • 618 (11.2%) true positives • Sashimi 17mix_test2 (S17) • 1389 MS/MS spectra (Q-TOF) • 354 (25.4%) true positives • AURUM 1.0 (364 Proteins) • 7508 MS/MS spectra (MALDI-TOF-TOF) • 3775 (50.3%) true positives
PepArML Machine Learning • Machine learning (generally) helps single search engines • PepArML result-combiner (C-TMO) improves on single search engines • Sometimes combining two search engines works as well, or better, than three
Feature Evaluation Tandem Mascot OMSSA
Application to Real Data • How well do these models generalize? • Different instruments • Spectral characteristics change scores • Search parameters • Different parameters change score values • Supervised learning requires • (Synthetic) experimental data from every instrument • Search results from available search engines • Training/models for all parameters x search engine sets x instruments
Model Generalization Train S17 / Score S17 Train C8 / Score S17
Rescuing Machine Learning • Train a new machine learning model for every dataset! • Generalization not required • No predetermined search engines, parameters, instruments, features • Perhaps we can “guess” the true proteins • Most proteins not in doubt • Machine learning can tolerate imperfect labels
Protein Selection Heuristic • Modeled on typical protein identification criteria • High confidence peptide IDs • At least 2 non-overlapping peptides • At least 10% sequence coverage • Robust, fast convergence • Easily enforce additional constraints
What about real data? Dr. Rado Goldman (LCCC, GUMC) • Proteolytic serum peptides from clinical hepatocellular carcinoma samples • ~ 200 MALDI MS/MS Spectra (TOF-TOF) PepArML for non-specific search of IPI-Human • Increase in confidence & sensitivity • Observation of “ragged” proteolytic trimming
Protein Identification Example M T O *
Future Directions • Apply to more experimental datasets • Integrate • novel features • new search engines, spectral matching • multiple searches with varied parameters, sequence databases • Construct meta-search engine • FDR by bimodal fit instead of decoys • Release as open source • http://peparml.sourceforge.org
Acknowledgements • Xue Wu* & Dr. Chau-Wen Tseng, • Computer ScienceUniversity of Maryland, College Park • Dr. Brian Balgley, Dr. Paul Rudnick • Calibrant Biosystems & NIST • Dr. Rado Goldman, Dr. Yanming An • Department of OncologyGeorgetown University Medical Center • Kam Ho To • Biochemistry Masters studentGeorgetown University • Funding: NIH/NCI CPTAC