1 / 32

Search Engine Result Combining

Search Engine Result Combining. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Peptide Identification Results. Search engines provide an answer for every spectrum... Can we figure out which ones to believe? Why is this hard?

maxine
Download Presentation

Search Engine Result Combining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

  2. Peptide Identification Results • Search engines provide an answer for every spectrum... • Can we figure out which ones to believe? • Why is this hard? • Hard to determine “good” scores • Significance estimates are unreliable • Need more ids from weak spectra • Each search engine has its strengths ...... and weaknesses • Search engines give different answers

  3. Mascot Search Results

  4. Translation start-site correction • Halobacterium sp. NRC-1 • Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins • Goo, et al. MCP 2003. • GdhA1 gene: • Glutamate dehydrogenase A1 • Multiple significant peptide identifications • Observed start is consistent with Glimmer 3.0 prediction(s)

  5. Halobacterium sp. NRC-1ORF: GdhA1 • K-score E-value vs PepArML @ 10% FDR • Many peptides inconsistent with annotated translation start site of NP_279651

  6. Translation start-site correction

  7. Search engine scores are inconsistent! Tandem Mascot

  8. Common Algorithmic Framework – Different Results • Pre-process experimental spectra • Charge state, cleaning, binning • Filter peptide candidates • Decide which PSMs to evaluate • Score peptide-spectrum match • Fragmentation modeling, dot product • Rank peptides per spectrum • Retain statistics per spectrum • Estimate E-values • Appy empirical or theoretical model

  9. Comparison of search engines • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment OMSSA Mascot 10% 4% 2% 69% 9% 5% 2% X!Tandem

  10. Lots of techniques out there • Treat search engines as black-boxes • Generate PSMs + scores, features • Apply supervised machine learning to results • Use multiple match metrics • Combine/refine using multiple search engines • Agreement suggests correctness • Use empirical significance estimates • “Decoy” databases (FDR)

  11. Machine Learning • Use of multiple metrics of PSM quality: • Precursor delta, trypsin digest features, etc • Requires "training" with examples • Different examples will change the result • Generalization is always the question • Scores can be hard to "understand" • Difficult to establish statistical significance • Peptide Prophet's discriminant function • Weighted linear combination of features

  12. Combine / Merge Results Threshold peptide-spectrum matches from each of two search engines • PSMs agree → boost specificity • PSMs from one → boost sensitivity • PSMs disagree → ????? • Sometimes agreement is "lost" due to threshold... • How much should agreement increase our confidence? • Scores easy to "understand" • Difficult to establish statistical significance • How to generalize to more engines?

  13. Consensus and Meta-Search • Multiple witnesses increase confidence • As long as they are independent • Example: Getting the story straight • Independent "random" hits unlikely to agree • Agreement is indication of biased sampling • Example: loaded dice • Meta-search is relatively easy • Merging and re-ranking is hard • Example: Booking a flight to Denver! • Scores and E-values are not comparable • How to choose the best answer? • Example: Best E-value favors Tandem!

  14. Searching for Consensus Search engine quirks can destroy consensus • Initial methionine loss as tryptic peptide • Charge state enumeration or guessing • X!Tandem's refinement mode • Pyro-Gln, Pyro-Glu modifications • Difficulty tracking spectrum identifiers • Precursor mass tolerance (Da vs ppm) Decoy searches must be identical!

  15. Configuring for Consensus Search engine configuration can be difficult: • Correct spectral format • Search parameter files and command-line • Pre-processed sequence databases. • Tracking spectrum identifiers • Extracting peptide identifications, especially modifications and protein identifiers

  16. Simple unified search interface for: Mascot, X!Tandem, K-Score, S-Score, OMSSA, MyriMatch, InsPecT Automatic decoy searches Automatic spectrumfile "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid Peptide Identification Meta-Search

  17. Peptide Identification Grid-Enabled Meta-Search X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA. Secure communication Edwards Lab Scheduler & 80+ CPUs Scales easily to 250+ simultaneoussearches X!Tandem, KScore, OMSSA. Single, simplesearch request UMIACS 250+ CPUs

  18. PepArML • Peptide identification arbiter by machine learning • Unifies these ideas within a model-free, combining machine learning framework • Unsupervised training procedure

  19. PepArML Overview Feature extraction X!Tandem PepArML Mascot OMSSA Other

  20. Dataset Construction X!Tandem Mascot OMSSA T F T …… T

  21. Voting Heuristic Combiner • Choose PSM with most votes • Break ties using FDR • Select PSM with min. FDR of tied votes • How to apply this to a decoy database? • Lots of possibilities – all imperfect • Now using: 100*#votes – min. decoy hits

  22. Supervised Learning

  23. Feature Evaluation

  24. Application to Real Data • How well do these models generalize? • Different instruments • Spectral characteristics change scores • Search parameters • Different parameters change score values • Supervised learning requires • (Synthetic) experimental data from every instrument • Search results from available search engines • Training/models for all parameters x search engine sets x instruments

  25. Model Generalization

  26. Unsupervised Learning

  27. Unsupervised Learning Performance

  28. Unsupervised Learning Convergence

  29. Peptide Atlas A8_IP – LTQ

  30. OMICS 17 Protein Mix – LCQ

  31. Feature Selection (InfoGain)

  32. Conclusions • Combining search results from multiple engines can be very powerful • Boost both sensitivity and specificity • Running multiple search engines is hard • Statistical significance is hard • Use empirical FDR estimates...but be careful...lots of subtleties • Consensus is powerful, but fragile • Search engine quirks can destroy it • "Witnesses" are not independent

More Related