Improving Sensitivity by Combining Results from Multiple Search Methodologies

Improving Sensitivity by Combining Results from Multiple Search Methodologies Brian C. Searle Proteome Software Inc. Portland, OR Brian.Searle@ProteomeSoftware.com

Control Experiments Database-searching programs generally identify only a fraction of the spectra acquired in a standard LC/MS/MS study of digested proteins. In this study, two different datasets were analyzed using quadrapole-time of flight (Q-TOF) and ion trap tandem mass spectrometers. These pie charts show the percent of the spectra identified by SEQUEST for a mixture of ten purified proteins combined into a known control mixture. IDed by SEQUEST Q-TOF Unknown Spectra IDed by SEQUEST Ion Trap Unknown Spectra

Biological Samples For actual biological samples, the results are similar. SEQUEST matched a small fraction of the spectra to peptides in the database. In this case the samples are soluble human lens proteins separated using two-dimensional liquid chromatography. How can we uncover the information hidden in the spectra that SEQUEST misses? Will another program like Mascot or X!Tandem do better? IDed by SEQUEST Q-TOF Unknown Spectra IDed by SEQUEST Ion Trap Unknown Spectra

Search Engine Overlap SEQUEST Each search engine identifies about the same number of spectra, But the overlap is surprisingly small. Different search engines match different spectra. 9% 22% 4% 34% X!Tandem Mascot 19% 7% 5%

Why Overlap Small considers intensities SEQUEST The reason that they identify different spectra is because each program has different strengths. 9% 22% 4% 34% X!Tandem Mascot 19% 7% 5% probability based scoring semi-tryptic, no neutral loss fragments

Scatter Plot of SEQUEST vs Tandem Each point on this plot corresponds to one spectrum from the control sample. The x-axis shows that spectrum’s SEQUEST score and the y-axis its X!Tandem score. Spectra identified as one of the 10 known proteins in this control sample are plotted in blue. All other identifications are plotted in red. A plot of Mascot against X!Tandem looks similar. X!Tandem –log(E-Value) Score SEQUEST Descriminant Score (Peptide Prophet)

Search Thresholds The current practice is to specify a threshold for SEQUEST (vertical green line) and accept all identifications above the line and reject all those below the line. Similarly for X!Tandem, a threshold is chosen (horizontal green line) that separates the good from the bad. X!Tandem –log(E-Value) Score SEQUEST Descriminant Score (Peptide Prophet)

Missed identifications below one threshold only Using only SEQUEST’s cutoff, many good identifications are missed. Picking up these “missed IDs” is an easy way to find more peptides without adding many false positive identifications. This idea, done in a more statistically rigorous way, is how Proteome Software’s Scaffold program enhances SEQUEST and Mascot to mine more information from the MS/MS spectra. Missed IDs X!Tandem –log(E-Value) Score Missed IDs SEQUEST Descriminant Score (Peptide Prophet)

Scaffold Analysis Framework • Combine SEQUEST or Mascot with X!Tandem • Use clustering and noise filters to remove uninteresting spectra • Export interesting, unidentified spectra for future analysis The focus of this presentation, the combining of SEQUEST or Mascot with X!Tandem, fits into the broader framework provided by Scaffold. Search Wider Drill Deeper Remove Junk Focus Efforts Combine Database Searching IDs Cluster Spectra to Previously IDs Report Interesting, Unidentified Spectra Filter Electronic Noise For All Spectra

Scaffold Workflow (part 1) Peptide Prophet* Protein Prophet* Get SEQUEST IDs Calculate SEQUEST Probability Calculate Combined Peptide Probability Get Mascot IDs Calculate Mascot Probability For Each Spectrum Calculate Protein Probabilities Get X!Tandem IDs Calculate X!Tandem Probability Scaffold Merger Scaffold uses another algorithm by Nesvizskii to combine peptide probabilities. … Scaffold uses Nesvizhskii’s algorithm to convert SEQUEST and Mascot scores to peptide probabilities *Nesvizhskii, A. I. et al, Anal. Chem.2003, 75, 4646-4658

Scaffold Merger Principle • The scheme that Scaffold uses to merge the probability estimates from SEQUEST or Mascot with X!Tandem is based upon applying Bayesian statistics to combine: • The probability of identifying a spectrum with • The probability of agreement between search methods. • The probability of agreement between search methods is calculated analagously to the way Nesvizhskii’s ProteinProphet algorithm calculates the probability of the number of sibling peptides. Calculate Combined Peptide Probability Scaffold Merger *Nesvizhskii, A. I. et al, Anal. Chem.2003, 75, 4646-4658

SEQUEST IDs of peptides Peptide 1 Get SEQUEST Identification p=85% p=76% Get Mascot Identification Peptide 2 For Each Spectrum Get X!Tandem Identification p=54% Peptide 3 For a given spectrum SEQUEST matches to a set of peptides. SEQUEST ranks some peptide matches as better, that is, they have a higher probability of being correct.

Mascot IDs of peptides Peptide 1 Get SEQUEST Identification Peptide 4 p=27% Get Mascot Identification Peptide 2 p=81% For Each Spectrum Peptide 5 Get X!Tandem Identification p=35% Peptide 3 If Mascot searched with the same spectrum, it will also match to a set of peptides. Again some matches are better, that is, have a higher probability of being correct.

Agreement between searches Peptide 1 Peptide 7 Get SEQUEST Identification Peptide 4 p=76% Get Mascot Identification Peptide 2 Peptide 8 p=81% For Each Spectrum p=56% Peptide 5 Get X!Tandem Identification Peptide 3 Peptide 6 Sometimes several of the search engines will match the spectrum to the same peptide. From seeing how often this happens, Scaffold calculates the probability of agreement between search engines.

Agreement score Peptide 1 Peptide 7 Get SEQUEST Identification Peptide 4 p=76% Get Mascot Identification Peptide 2 Peptide 8 p=81% For Each Spectrum p=56% Peptide 5 Get X!Tandem Identification Peptide 3 Peptide 6 Using the probabilities given by each search engine and the probability of them agreeing, a better peptide ID is made

1 0.8 0.6 Combination Model Tandem Model SEQUEST Model Mascot Model Ideal 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Accuracy of the Probability Combining Model (Ideal=45º) The combined probability of SEQUEST, Mascot and X!Tandem was calculated for a the set of spectra from the 10 protein control sample. In this control sample counting the fraction of true identifications at every score gives the “actual probability”. The calculated probability for the combination model (blue line on the graph) matches the actual probability quite well. Mascot X!Tandem Calculated Probability Scaffold SEQUEST Actual Probability

QTOF Spectra Correctly Identified by SEQUEST as Control Proteins Identified By SEQUEST (40%) Lets see how much difference this makes. For high quality QTOF data on the control proteins, SEQUEST alone identified 40% of the spectra. Unknown Spectra (60%)

QTOF Spectra Correctly Identified by SEQUEST and Tandem as Control Proteins Identified By Scaffold (60%) When Scaffold, using the algorithm just described, combined the search by SEQUENT with a search with X!Tandem, the number of spectra identified increased to 60%. Unknown Spectra (40%)

QTOF Spectra Correctly Identified by Scaffold as Control Proteins Identified By Scaffold (73%) When Scaffold added in the results of several other search programs, the number of spectra identified increased further. Unknown Spectra (27%)

Workflow (part 2 - cluster) Scaffold starts with the algorithm described so far (the box labeled Scaffold Merger) and continues with an algorithm that sweeps up spectra that are similar to spectra confidently identified as belonging to a protein. These spectra are uninteresting because they are redundant – they don’t identify any new peptides or PMT’s. Protein Prophet Find Spectra Similar to Previously Identified Report Interesting, Unidentified Spectra Calculate Combined Probability Calculate Protein Probabilities Filter Electronic Noise Scaffold Merger Scaffold Cluster

Workflow (part 2 - filter) Next Scaffold filters out low quality spectra. These spectra are either very noisy, or belong to peptides that didn’t fragment. These low quality spectra are not worth trying to investigate further. Protein Prophet Find Spectra Similar to Previously Identified Report Interesting, Unidentified Spectra Calculate Combined Probability Calculate Protein Probabilities Filter Electronic Noise The good quality spectra left over are worth investigating further. They are likely to be interesting new proteins or modifications. Scaffold Merger Scaffold Cluster

QTOF Spectra not yet identified Identified By Scaffold (73%) Let’s look again at the control sample analyzed by the QTOF. The red section of this chart shows the spectra not yet identified. How many of these are worth further investigation? Unknown Spectra (27%)

QTOF Spectra worth investigating further Identified By Scaffold (74%) Scaffold identifies The redundant spectra (labeled “Add Clustering”), The noise (labeled “No Signal”), The “No Fragmentation” spectra. Only the “Unknown Spectra” are worth a further look. Unknown Spectra (5%) Not Interesting (21%)

Ion Trap Control with Scaffold IDed by SEQUEST IDed by Scaffold Ion Trap Unknown Spectra Unknown Spectra 65% total comprehension IDed by SEQUEST The same control sample was also run on an ion trap. Compared to SEQUEST alone, Scaffold identifies a much larger fraction of the ion trap spectra. This shows that Scaffold’s scheme for identifying more spectra and focusing attention for further investigation only on good quality spectra works for both QTOFs and ion traps.

QTOF Biological Samples with Scaffold IDed by SEQUEST Alone IDed by SEQUEST IDed by Scaffold Q-TOF Unknown Spectra 79% total comprehension The analysis of the QTOF spectra from our soluble lens protein samples demonstrates that Scaffold enhances the understanding of real biological samples as well as control samples.

Ion Trap Biological Samples with Scaffold IDed by SEQUEST Alone IDed by SEQUEST IDed by Scaffold Ion Trap Unknown Spectra 75% total comprehension This improvement in the analysis is also true for the lens samples run on an ion trap.

More spectra for every protein Scaffold finds more spectra for essentially every protein in the sample. The samples shown here are from an ion trap analysis of saliva. The red bars show the number of spectra matched with 95% confidence by SEQUEST to each of the 120 proteins in the saliva sample. The blue bars show the additional spectra identified for each protein by Scaffold. In this case Scaffold combined the searches by SEQUEST and X!Tandem. Number of High Quality (95%) Spectrum Matches Proteins (sorted by # spectrum matches)

Scaffold Increases Number of Protein Found in Saliva By focusing on the portion of this graph with fewer spectra matching each protein, we can easily see there are many proteins with only blue bars. These proteins were not identified by SEQUEST. They were identified (95% confidence) by Scaffold. Scaffold found 43% more proteins in this saliva sample than SEQUEST alone did. Number of High Quality (95%) Spectrum Matches Proteins (sorted by # spectrum matches)

Acknowledgements • Proteome Software Inc. • Mark Turner • James Brundege • Ashley McCormack • Oregon Health & Sciences University • Srinivasa Nagalla (Control Dataset) • Larry David (Lens Dataset) • Phil Wilmarth (Saliva Dataset)

Improving Sensitivity by Combining Results from Multiple Search Methodologies

Improving Sensitivity by Combining Results from Multiple Search Methodologies

Presentation Transcript

Improving Sensitivity by Combining Results from Multiple Search Methodologies

Your Search Returned 0 Results: Improving Digital Library Search Tools

Multiple Chemical Sensitivity

Improving Search

Combining Multiple References

Intent Mining from Search Results

Chapter 4 Search Methodologies

Combining Multiple Classifiers

Combining options for commitments: results from modelling exercises

Improving Web Search Results Using Affinity Graph

Improving Web Search Results Using Affinity Graph

Combining prevalence estimates from multiple sources

Multiple Chemical Sensitivity

Combining Resources: Taxonomy Extraction from Multiple Dictionaries

Combining Multiple Representations on the TRECVID Search Task

*Search results from clinicaltrials

Artificial Intelligence Search Methodologies

Search Engine Result Combining

Artificial Intelligence Search Methodologies

Search Methodologies

Search Methodologies