Improving Sensitivity by Combining Results from Multiple Search Methodologies

Improving Sensitivity by Combining Results from Multiple Search Methodologies Brian C. Searle Proteome Software Inc. Portland, OR Brian.Searle@ProteomeSoftware.com MBI workshop on Computational Proteomics and Mass Spectrometry (January 11-14, 2005)

The Analytical Challenge Biological Samples Control Experiments Q-TOF Unknown Spectra IonTrap

The Analytical Challenge • Why can you only interpret half as much MS/MS data in experiments you actually care about? • What is going on with the remaining 90% unidentified spectra?

The OpenSea Approach De Novo Sequence: YD[Cc]DD[220]GADHFTY[200]R OpenSea Alignment: Crystallin, S (CRBS_HUMAN) GRRYD(Cc)D(Cc)( D )(Cc)AD(FH)TY( LS )RCNS || | | X X X || | || | | YD(Cc)D(D )([220])(G )AD(HF)TY([200])R

de novo Sequence YD[Cc]DD[220]GADHFTY[200]R 163-115-160-115-115-220-57-71-…

de novo Sequence … YD[Cc]DD[220]GADHFTY[200]R 163-115-160-115-115-220-57-71-… G-57 R-156 R-156 Y-163 D-115 C-160 D-115 C-160 D-115 C-160 A-71 Database Sequence …

Auto-Interpretation of OpenSea Results OpenSea Alignment: GRRYD(Cc)D(Cc)( D )(Cc)AD(FH)TY( LS )RCNS || | | X X X || | || | | YD(Cc)D(D )([220])(G )AD(HF)TY([200])R +14 AMU on either cysteine or -43 AMU on aspartic acid… Modification lookup table suggests methylation of cysteine! Auto-Interpretation: GRRYD(Cc)D( CmDCc )AD(FH)TY( LS )RCNS || | | : || | || | | YD(Cc)D(D[220]G)AD(HF)TY([200])R

Spectrum Identification Overlap Between Search Methods SEQUEST 6% 17% 7% 41% X!Tandem 10% 10% OpenSea PTMs polymorphisms 9%

Spectrum Identification Overlap Between Search Methods SEQUEST neutral losses 6% 17% 7% 41% X!Tandem semi-tryptic no ladder 10% 10% OpenSea 9%

Scaffold Data Compiler • Combine SEQUEST, Mascot, X!Tandem, and OpenSea results • Utilize spectrum clustering and noise filters to remove uninteresting spectra • Export interesting, unidentified spectra for further analysis Search Wider Drill Deeper Remove Junk Focus Efforts Combine Database Searching IDs Cluster Spectra to Previously IDs Report Interesting, Unidentified Spectra Filter Electronic Noise For All Spectra

Combining SEQUEST and X!Tandem Scores X!Tandem –log(E-Value) Score SEQUEST Descriminant Score (Peptide Prophet, ISB)

Peptide Prophet (ISB) Incorrect IDs p=50% Correct IDs

Protein Prophet (ISB) Protein 1 Protein 7 Peptide 1 Protein 4 Peptide 2 Peptide 3 Protein 2 Protein 8 Peptide 4 Protein 5 Peptide 5 Protein 3 Peptide 6 Protein 6 Peptide 7

Incorrect IDs p(NSP|-) Correct IDs p(NSP|+) Normalized Distribution For each spectrum… IDs with: high NSP--p Low NSP--p NSP Bin Number Log p(NSP|+)/p(NSP|-) Correct IDs have higher NSP Values

Peptide Prophet Protein Prophet Get SEQUEST IDs Calculate SEQUEST Probability Get Mascot IDs Calculate Mascot Probability Calculate Combined Peptide Probability For Each Spectrum Calculate Protein Probabilities Get X!Tandem IDs Calculate X!Tandem Probability Scaffold Merge Prophet Get OpenSea IDs Calculate OpenSea Probability …

Peptide 1 Get SEQUEST Identification p=85% p=76% Get Mascot Identification Peptide 2 For Each Spectrum Get X!Tandem Identification p=54% Peptide 3 Get OpenSea Identification

Peptide 1 Get SEQUEST Identification Peptide 4 p=27% Get Mascot Identification Peptide 2 p=81% For Each Spectrum Peptide 5 Get X!Tandem Identification p=35% Peptide 3 Get OpenSea Identification

Peptide 1 Peptide 7 Get SEQUEST Identification Peptide 4 Get Mascot Identification Peptide 2 Peptide 8 For Each Spectrum Peptide 5 Get X!Tandem Identification Peptide 3 Peptide 6 Get OpenSea Identification

Protein Prophet’s NSP value (number of sibling peptides) becomes… Merge Prophet’s number of sibling programs

Incorrect IDs p(NSP|-) Correct IDs p(NSP|+) Normalized Distribution For each spectrum… IDs with: high NSP--p Low NSP--p NSP Bin Number Log p(NSP|+)/p(NSP|-) Correct IDs have higher NSP Values

Accuracy of the Probability Combining Model Mascot X!Tandem Calculated Probability Combination SEQUEST Actual Probability

Percentage of QTOF Spectra Correctly Identified as Control Proteins Identified By SEQUEST (40%) Unknown Spectra (60%)

Percentage of QTOF Spectra Correctly Identified as Control Proteins Identified By Scaffold (60%) Unknown Spectra (40%)

Protein Prophet Find Spectra Similar to Previously Identified Report Interesting, Unidentified Spectra Calculate Combined Probability Calculate Protein Probabilities Filter Electronic Noise Scaffold Merge Prophet Scaffold Cluster Prophet

Cluster Prophet Principle If an unidentified spectrum is 95% similar to a correctly identified spectrum… it is also considered to be identified.

Rank-Based Cluster Similarity Score Incorrect IDs p=50% Correct IDs

MS/MS Spectrum Filter • Dynamic range filter removes spectra from peptides with poor/no fragmentation • Signal to noise filter removes electronic noise

Percentage of QTOF Spectra Correctly Identified as Control Proteins Identified By Scaffold (74%) Unknown Spectra (5%) Not Interesting (21%)

Percentage of 2D-LC QTOF Spectra Correctly Identified as Lens Proteins Identified By Scaffold (48%) Unknown Spectra (21%) Not Interesting (31%)

The Analytical Challenge Biological Samples Control Experiments IDed by SEQUEST IDed by SEQUEST Q-TOF Unknown Spectra Unknown Spectra IDed by SEQUEST IDed by SEQUEST IonTrap Unknown Spectra Unknown Spectra

The Analytical Challenge Biological Samples Control Experiments IDed by Scaffold IDed by Scaffold Q-TOF Unknown Spectra Unknown Spectra 85% more IDs 95% comprehension 336% more IDs 79% comprehension IDed by Scaffold IDed by Scaffold IonTrap Unknown Spectra Unknown Spectra 48% more IDs 65% comprehension 227% more IDs 75% comprehension

Conclusions • Using Scaffold technologies, you can drill deeper and search wider using multiple database searching approaches and MS/MS spectrum clustering • Scaffold and implementations of Peptide/Protein Prophet were written in platform-independent Java • Scaffold will be available at ASMS 2005

OpenSea Team (OHSU) Srinivasa Nagalla Surendra Dasari Ashok Reddy Larry David Phil Wilmarth Ashley McCormack Contact: nagallas@ohsu.edu Scaffold Team (Proteome Software Inc.) Mark Turner James Brundege Contact: Brian.Searle@ ProteomeSoftware.com Acknowledgements

Improving Sensitivity by Combining Results from Multiple Search Methodologies

Improving Sensitivity by Combining Results from Multiple Search Methodologies

Presentation Transcript

Your Search Returned 0 Results: Improving Digital Library Search Tools

Multiple Chemical Sensitivity

Improving Search

Combining Multiple References

Intent Mining from Search Results

Chapter 4 Search Methodologies

Combining Multiple Classifiers

Combining options for commitments: results from modelling exercises

Improving Web Search Results Using Affinity Graph

Improving Web Search Results Using Affinity Graph

Combining prevalence estimates from multiple sources

Multiple Chemical Sensitivity

Combining Resources: Taxonomy Extraction from Multiple Dictionaries

Improving Sensitivity by Combining Results from Multiple Search Methodologies

Combining Multiple Representations on the TRECVID Search Task

*Search results from clinicaltrials

Artificial Intelligence Search Methodologies

Search Engine Result Combining

Artificial Intelligence Search Methodologies

Search Methodologies

Search Methodologies