Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

High Quality Peptide Identification: E-value < 10-8

Moderate quality peptide identification: E-value < 10-3

Peptide Identification • Peptide fragmentation by CID is poorly understood • MS/MS spectra represent incomplete information about amino-acid sequence • I/L, K/Q, GG/N, … • Correct identifications don’t come with a certificate!

Peptide Identification • High-throughput workflows demand we analyze all spectra, all the time. • Spectra may not contain enough information to be interpreted correctly • …bad static on a cell phone • Peptides may not match our assumptions • …its all Greek to me • “Don’t know”is an acceptable answer!

Peptide Identification • Rank the best peptide identifications • Is the top ranked peptide correct?

Peptide Identification • Incorrect peptide has best score • Correct peptide is missing? • Potential for incorrect conclusion • What score ensures no incorrect peptides? • Correct peptide has weak score • Insufficient fragmentation, poor score • Potential for weakened conclusion • What score ensures we find all correct peptides?

Statistical Significance • Can’t prove particular identifications are right or wrong... • ...need to know fragmentation in advance! • A minimal standard for identification scores... • ...better than guessing. • p-value, E-value, statistical significance

Pin the tail on the donkey…

Throwing darts One at a time Blindfolded Uniform distribution? Independent? Identically distributed? Pr [ Dart hits 20 ] = 0.05 Probability Concepts

Probability Concepts Throwing darts • One at a time • Blindfolded • Three darts Pr [Hitting 20 3 times] = 0.05 * 0.05 * 0.05 Pr [Hit 20 at least twice] = 0.007125 + 0.000125

Probability Concepts

Probability Concepts Throwing darts • One at a time • Blindfolded • 100 darts Pr [Hitting 20 3 times] = 0.139575 Pr [Hit 20 at least twice] = 0.9629188

Probability Concepts

Match Score • Dartboard represents the mass range of the spectrum • Peaks of a spectrum are “slices” • Width of slice corresponds to mass tolerance • Darts represent • random masses • masses of fragments of a random peptide • masses of peptides of a random protein • masses of biomarkers from a random class • How many darts do we get to throw?

100 % Intensity 0 m/z 250 500 750 1000 Match Score What is the probability that we match at least 5 peaks? 270 330 870 550 755 580

Match Score • Pr [ Match ≥ s peaks ] = Binomial( p , n ) ≈ Poisson( p n ), for small p and large n p is prob. of random mass / peak match, n is number of darts (fragments in our answer)

Match Score Theoretical distribution • Used by OMSSA • Proposed, in various forms, by many. • Probability of random mass / peak match • IID (independent, identically distributed) • Based on match tolerance

Match Score Theoretical distribution assumptions • Each dart is independent • Peaks are not “related” • Each dart is identically distributed • Chance of random mass / peak match is the same for all peaks

Tournament Size 100 people 1000 people 100 Darts, # 20’s 100000 people 10000 people

Number of Trials • Tournament size == number of trials • Number of peptides tried • Related to sequence database size • Probability that a random match score is ≥ s • 1 – Pr [ all match scores < s ] • 1 – Pr [ match score < s ] Trials (*) • Assumes IID! • Expect value • E = Trials * Pr [ match ≥ s ] • Corresponds to Bonferroni bound on (*)

Better Dart Throwers

Better Random Models • Comparison with completely random model isn’t really fair • Match scores for real spectra with real peptides obey rules • Even incorrect peptides match with non-random structure!

Better Random Models • Want to generate random fragment masses (darts) that behave more like the real thing: • Some fragments are more likely than others • Some fragments depend on others • Theoretical models can only incorporate this structure to a limited extent.

Better Random Models • Generate random peptides • Real looking fragment masses • No theoretical model! • Must use empirical distribution • Usually require they have the correct precursor mass • Score function can model anything we like!

Better Random Models Fenyo & Beavis, Anal. Chem., 2003

Better Random Models • Truly random peptides don’t look much like real peptides • Just use peptides from the sequence database! • Caveats: • Correct peptide (non-random) may be included • Peptides are not independent • Reverse sequence avoids only the first problem

Extrapolating from the Empirical Distribution • Often, the empirical shape is consistent with a theoretical model Geer et al., J. Proteome Research, 2004 Fenyo & Beavis, Anal. Chem., 2003

False Positive Rate Estimation • Each spectrum is a chance to be right, wrong, or inconclusive. • How many decisions are wrong? • Given identification criteria: • SEQUEST Xcorr, E-value, Score, etc., plus... • ...threshold • Use “decoy” sequences • random, reverse, cross-species • Identifications must be incorrect!

False Positive Rate Estimation • # FP in real search = # hits in decoy search • Need same size database, or rate conversion • FP Rate: # decoy hits # real hits • FP Rate: 2 x # decoy hits . (# real hits + # decoy hits)

False Positive Rate Estimation • A form of statistical significance • In “theory”, E-value and a FP rate are the same. • Search engine independent • Easy to implement • Assumes a single threshold for all spectra • Spectrum/Peptide Identification scores are not iid!... • ...but E-values, in principle, are.

Peptide Prophet • From the Institute for Systems Biology • Keller et al., Anal. Chem. 2002 • Re-analysis of SEQUEST results • Spectra are trials • Assumes that many of the spectra are not correctly identified

Peptide Prophet Keller et al., Anal. Chem. 2002 Distribution of spectral scores in the results

Peptide Prophet • Assumes a bimodal distribution of scores, with a particular shape • Ignores database size • …but it is included implicitly • Like empirical distribution for peptide sampling, can be applied to any score function • Can be applied to any search engines’ results

Peptide Prophet • Caveats • Are spectra scores sampled from the same distribution? • Is there enough correct identifications for second peak? • Are spectra independent observations? • Are distributions appropriately shaped? • Huge improvement over raw SEQUEST results

Peptides to Proteins Nesvizhskii et al., Anal. Chem. 2003

Peptides to Proteins

Peptides to Proteins • A peptide sequence may occur in many different protein sequences • Variants, paralogues, protein families • Separation, digestion and ionization is not well understood • Proteins in sequence database are extremely non-random, and very dependent

Publication Guidelines

Publication Guidelines • Computational parameters • Spectral processing • Sequence database • Search program • Statistical analysis • Number of peptides per protein • Each peptide sequence counts once! • Multiple forms of the same peptide count once!

Publication Guidelines • Single-peptide proteins must be explicitly justified by • Peptide sequence • N and C terminal amino-acids • Precursor mass and charge • Peptide Scores • Multiple forms of the peptide counted once! • Biological conclusions based on single-peptide proteins must show the spectrum

Publication Guidelines • More stringent requirements for PMF data analysis • Similar to that for tandem mass spectra • Management of protein redundancy • Peptides identified from a different species? • Spectra submission encouraged

Summary • Could guessing be as effective as a search? • More guesses improves the best guess • Better guessers help us be more discriminating • Peptide to proteins is not as simple as it seems • Publication guidelines reflect sound statistical principles.

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Presentation Transcript

Tandem Mass Spectrometry QA/QC for Newborn Screening: Routine Operations.

Measurement of cotinine in urine by liquid chromatography tandem mass spectrometry

PEAKS: De Novo Sequencing using Tandem Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression

A Neural Network Predictor for Peptide Fragmentation in Mass Spectrometry

Tandem Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry

Identification of Historical Colourants by Mass Spectrometry Volodymyr Pauk

Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification

Protein Identification and Peptide Sequencing by Liquid Chromatography – Mass Spectrometry

Algorithms for Peptide Mass Spectrometry

Efficient and accurate algorithms for peptide mass spectrometry

Peptide Sequencing by Mass Spectrometry

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry

Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry

Peptide Identification via Tandem Mass Spectrometry Sorin Istrail

Protein Identification Using Tandem Mass Spectrometry

Mass Spectrometry-Based Methods for Protein Identification

PROTEIN IDENTIFICATION BY MASS SPECTROMETRY

Peptide Sequencing by Mass Spectrometry