490 likes | 873 Views
Statistical Significance for Peptide Identification by Tandem Mass Spectrometry. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. High Quality Peptide Identification: E -value < 10 -8.
E N D
Statistical Significance for Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park
Peptide Identification • Peptide fragmentation by CID is poorly understood • MS/MS spectra represent incomplete information about amino-acid sequence • I/L, K/Q, GG/N, … • Correct identifications don’t come with a certificate!
Peptide Identification • High-throughput workflows demand we analyze all spectra, all the time. • Spectra may not contain enough information to be interpreted correctly • …bad static on a cell phone • Peptides may not match our assumptions • …its all Greek to me • “Don’t know”is an acceptable answer!
Peptide Identification • Rank the best peptide identifications • Is the top ranked peptide correct?
Peptide Identification • Rank the best peptide identifications • Is the top ranked peptide correct?
Peptide Identification • Rank the best peptide identifications • Is the top ranked peptide correct?
Peptide Identification • Incorrect peptide has best score • Correct peptide is missing? • Potential for incorrect conclusion • What score ensures no incorrect peptides? • Correct peptide has weak score • Insufficient fragmentation, poor score • Potential for weakened conclusion • What score ensures we find all correct peptides?
Statistical Significance • Can’t prove particular identifications are right or wrong... • ...need to know fragmentation in advance! • A minimal standard for identification scores... • ...better than guessing. • p-value, E-value, statistical significance
Throwing darts One at a time Blindfolded Uniform distribution? Independent? Identically distributed? Pr [ Dart hits 20 ] = 0.05 Probability Concepts
Probability Concepts Throwing darts • One at a time • Blindfolded • Three darts Pr [Hitting 20 3 times] = 0.05 * 0.05 * 0.05 Pr [Hit 20 at least twice] = 0.007125 + 0.000125
Probability Concepts Throwing darts • One at a time • Blindfolded • 100 darts Pr [Hitting 20 3 times] = 0.139575 Pr [Hit 20 at least twice] = 0.9629188
Match Score • Dartboard represents the mass range of the spectrum • Peaks of a spectrum are “slices” • Width of slice corresponds to mass tolerance • Darts represent • random masses • masses of fragments of a random peptide • masses of peptides of a random protein • masses of biomarkers from a random class • How many darts do we get to throw?
100 % Intensity 0 m/z 250 500 750 1000 Match Score What is the probability that we match at least 5 peaks? 270 330 870 550 755 580
Match Score • Pr [ Match ≥ s peaks ] = Binomial( p , n ) ≈ Poisson( p n ), for small p and large n p is prob. of random mass / peak match, n is number of darts (fragments in our answer)
Match Score Theoretical distribution • Used by OMSSA • Proposed, in various forms, by many. • Probability of random mass / peak match • IID (independent, identically distributed) • Based on match tolerance
Match Score Theoretical distribution assumptions • Each dart is independent • Peaks are not “related” • Each dart is identically distributed • Chance of random mass / peak match is the same for all peaks
Tournament Size 100 people 1000 people 100 Darts, # 20’s 100000 people 10000 people
Tournament Size 100 people 1000 people 100 Darts, # 20’s 100000 people 10000 people
Number of Trials • Tournament size == number of trials • Number of peptides tried • Related to sequence database size • Probability that a random match score is ≥ s • 1 – Pr [ all match scores < s ] • 1 – Pr [ match score < s ] Trials (*) • Assumes IID! • Expect value • E = Trials * Pr [ match ≥ s ] • Corresponds to Bonferroni bound on (*)
Better Random Models • Comparison with completely random model isn’t really fair • Match scores for real spectra with real peptides obey rules • Even incorrect peptides match with non-random structure!
Better Random Models • Want to generate random fragment masses (darts) that behave more like the real thing: • Some fragments are more likely than others • Some fragments depend on others • Theoretical models can only incorporate this structure to a limited extent.
Better Random Models • Generate random peptides • Real looking fragment masses • No theoretical model! • Must use empirical distribution • Usually require they have the correct precursor mass • Score function can model anything we like!
Better Random Models Fenyo & Beavis, Anal. Chem., 2003
Better Random Models Fenyo & Beavis, Anal. Chem., 2003
Better Random Models • Truly random peptides don’t look much like real peptides • Just use peptides from the sequence database! • Caveats: • Correct peptide (non-random) may be included • Peptides are not independent • Reverse sequence avoids only the first problem
Extrapolating from the Empirical Distribution • Often, the empirical shape is consistent with a theoretical model Geer et al., J. Proteome Research, 2004 Fenyo & Beavis, Anal. Chem., 2003
False Positive Rate Estimation • Each spectrum is a chance to be right, wrong, or inconclusive. • How many decisions are wrong? • Given identification criteria: • SEQUEST Xcorr, E-value, Score, etc., plus... • ...threshold • Use “decoy” sequences • random, reverse, cross-species • Identifications must be incorrect!
False Positive Rate Estimation • # FP in real search = # hits in decoy search • Need same size database, or rate conversion • FP Rate: # decoy hits # real hits • FP Rate: 2 x # decoy hits . (# real hits + # decoy hits)
False Positive Rate Estimation • A form of statistical significance • In “theory”, E-value and a FP rate are the same. • Search engine independent • Easy to implement • Assumes a single threshold for all spectra • Spectrum/Peptide Identification scores are not iid!... • ...but E-values, in principle, are.
Peptide Prophet • From the Institute for Systems Biology • Keller et al., Anal. Chem. 2002 • Re-analysis of SEQUEST results • Spectra are trials • Assumes that many of the spectra are not correctly identified
Peptide Prophet Keller et al., Anal. Chem. 2002 Distribution of spectral scores in the results
Peptide Prophet • Assumes a bimodal distribution of scores, with a particular shape • Ignores database size • …but it is included implicitly • Like empirical distribution for peptide sampling, can be applied to any score function • Can be applied to any search engines’ results
Peptide Prophet • Caveats • Are spectra scores sampled from the same distribution? • Is there enough correct identifications for second peak? • Are spectra independent observations? • Are distributions appropriately shaped? • Huge improvement over raw SEQUEST results
Peptides to Proteins Nesvizhskii et al., Anal. Chem. 2003
Peptides to Proteins • A peptide sequence may occur in many different protein sequences • Variants, paralogues, protein families • Separation, digestion and ionization is not well understood • Proteins in sequence database are extremely non-random, and very dependent
Publication Guidelines • Computational parameters • Spectral processing • Sequence database • Search program • Statistical analysis • Number of peptides per protein • Each peptide sequence counts once! • Multiple forms of the same peptide count once!
Publication Guidelines • Single-peptide proteins must be explicitly justified by • Peptide sequence • N and C terminal amino-acids • Precursor mass and charge • Peptide Scores • Multiple forms of the peptide counted once! • Biological conclusions based on single-peptide proteins must show the spectrum
Publication Guidelines • More stringent requirements for PMF data analysis • Similar to that for tandem mass spectra • Management of protein redundancy • Peptides identified from a different species? • Spectra submission encouraged
Summary • Could guessing be as effective as a search? • More guesses improves the best guess • Better guessers help us be more discriminating • Peptide to proteins is not as simple as it seems • Publication guidelines reflect sound statistical principles.