320 likes | 392 Views
Search Engine Result Combining. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Peptide Identification Results. Search engines provide an answer for every spectrum... Can we figure out which ones to believe? Why is this hard?
E N D
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
Peptide Identification Results • Search engines provide an answer for every spectrum... • Can we figure out which ones to believe? • Why is this hard? • Hard to determine “good” scores • Significance estimates are unreliable • Need more ids from weak spectra • Each search engine has its strengths ...... and weaknesses • Search engines give different answers
Translation start-site correction • Halobacterium sp. NRC-1 • Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins • Goo, et al. MCP 2003. • GdhA1 gene: • Glutamate dehydrogenase A1 • Multiple significant peptide identifications • Observed start is consistent with Glimmer 3.0 prediction(s)
Halobacterium sp. NRC-1ORF: GdhA1 • K-score E-value vs PepArML @ 10% FDR • Many peptides inconsistent with annotated translation start site of NP_279651
Search engine scores are inconsistent! Tandem Mascot
Common Algorithmic Framework – Different Results • Pre-process experimental spectra • Charge state, cleaning, binning • Filter peptide candidates • Decide which PSMs to evaluate • Score peptide-spectrum match • Fragmentation modeling, dot product • Rank peptides per spectrum • Retain statistics per spectrum • Estimate E-values • Appy empirical or theoretical model
Comparison of search engines • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment OMSSA Mascot 10% 4% 2% 69% 9% 5% 2% X!Tandem
Lots of techniques out there • Treat search engines as black-boxes • Generate PSMs + scores, features • Apply supervised machine learning to results • Use multiple match metrics • Combine/refine using multiple search engines • Agreement suggests correctness • Use empirical significance estimates • “Decoy” databases (FDR)
Machine Learning • Use of multiple metrics of PSM quality: • Precursor delta, trypsin digest features, etc • Requires "training" with examples • Different examples will change the result • Generalization is always the question • Scores can be hard to "understand" • Difficult to establish statistical significance • Peptide Prophet's discriminant function • Weighted linear combination of features
Combine / Merge Results Threshold peptide-spectrum matches from each of two search engines • PSMs agree → boost specificity • PSMs from one → boost sensitivity • PSMs disagree → ????? • Sometimes agreement is "lost" due to threshold... • How much should agreement increase our confidence? • Scores easy to "understand" • Difficult to establish statistical significance • How to generalize to more engines?
Consensus and Meta-Search • Multiple witnesses increase confidence • As long as they are independent • Example: Getting the story straight • Independent "random" hits unlikely to agree • Agreement is indication of biased sampling • Example: loaded dice • Meta-search is relatively easy • Merging and re-ranking is hard • Example: Booking a flight to Denver! • Scores and E-values are not comparable • How to choose the best answer? • Example: Best E-value favors Tandem!
Searching for Consensus Search engine quirks can destroy consensus • Initial methionine loss as tryptic peptide • Charge state enumeration or guessing • X!Tandem's refinement mode • Pyro-Gln, Pyro-Glu modifications • Difficulty tracking spectrum identifiers • Precursor mass tolerance (Da vs ppm) Decoy searches must be identical!
Configuring for Consensus Search engine configuration can be difficult: • Correct spectral format • Search parameter files and command-line • Pre-processed sequence databases. • Tracking spectrum identifiers • Extracting peptide identifications, especially modifications and protein identifiers
Simple unified search interface for: Mascot, X!Tandem, K-Score, S-Score, OMSSA, MyriMatch, InsPecT Automatic decoy searches Automatic spectrumfile "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid Peptide Identification Meta-Search
Peptide Identification Grid-Enabled Meta-Search X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA. Secure communication Edwards Lab Scheduler & 80+ CPUs Scales easily to 250+ simultaneoussearches X!Tandem, KScore, OMSSA. Single, simplesearch request UMIACS 250+ CPUs
PepArML • Peptide identification arbiter by machine learning • Unifies these ideas within a model-free, combining machine learning framework • Unsupervised training procedure
PepArML Overview Feature extraction X!Tandem PepArML Mascot OMSSA Other
Dataset Construction X!Tandem Mascot OMSSA T F T …… T
Voting Heuristic Combiner • Choose PSM with most votes • Break ties using FDR • Select PSM with min. FDR of tied votes • How to apply this to a decoy database? • Lots of possibilities – all imperfect • Now using: 100*#votes – min. decoy hits
Application to Real Data • How well do these models generalize? • Different instruments • Spectral characteristics change scores • Search parameters • Different parameters change score values • Supervised learning requires • (Synthetic) experimental data from every instrument • Search results from available search engines • Training/models for all parameters x search engine sets x instruments
Conclusions • Combining search results from multiple engines can be very powerful • Boost both sensitivity and specificity • Running multiple search engines is hard • Statistical significance is hard • Use empirical FDR estimates...but be careful...lots of subtleties • Consensus is powerful, but fragile • Search engine quirks can destroy it • "Witnesses" are not independent