Deriving statistical models for predicting MS/MS product ion intensities

Deriving statistical models for predicting MS/MS product ion intensities Terry Speed & Frédéric Schütz Division of Genetics & Bioinformatics The Walter and Eliza Hall Institute of Medical Research In collaboration with the Joint ProteomicS Laboratory (WEHI/LICR)

Introduction • Proteomics is critical to our understanding of cellular biological processes • Mass Spectrometry (MS) has emerged as a key platform in proteomics for the high-throughput identification of proteins • Sophisticated algorithms, such as Mascot or Sequest, exist for database searching of MS/MS data • Major bottleneck: results must often be manually validated • More robust algorithms are needed before the identification of MS/MS data can be fully automated

What is a Mass Spectrometer ? “An analytical device that determines the molecular weight of chemical compounds by separating molecular ions according to their mass-to-charge ratio (m/z)” + + + + + + + Ionisation Separation + + + + + + + + + + + + + + + + + by m/z + + + + + + + + + + + + + + + + Detection molecular weight = 600 Da abundance = 50 % molecular weight = 400 Da abundance = 20 % molecular weight = 300 Da abundance = 30 % 50 30 20 301 401 m/z 601

+ ++ + + + + + Ionisation Separation + + + + ++ ++ ++ + + + + + + + + + + + by m/z ++ + + + + + + + + + + + + ++ + Detection molecular weight = 600 Da abundance = 50 % molecular weight = 400 Da abundance = 20 % molecular weight = 300 Da abundance = 30 % 50 30 20 10 201 301 401 m/z 601

Tandem MS (MS/MS) + + To gain structural information about the detected masses: • different molecules of the same substance can split in different ways. • in each molecule, only the pieces that retain one of the charges will be observed and present in the spectrum; the others are discarded. one product is selected + + collision ... Second MS + + with a gas + + separation & detection

How to use MS for protein identification Peptide mass fingerprinting • The exact protein needs to be in the database • Works only with single protein fragmentations DIGEST EXCISE MS 2D-GEL Proteins Sample m/z Example: peaks at m/z 333, 336, 406, 448, 462, 889 The only protein in the database that would produce these peaks is MALK|CGIR|GGSRPFLR|ATSK|ASR|SDD

+ - + + - + + - - + + + + + + + - - - - - - - + - + + + - + + - + Tandem MS for protein identification Capillary Column RP-HPLC (On-line; 60min Gradient) 2-D Gel (or 1-D Gel) In-gel Digest (Trypsin) Positive Ions Solvent Evaporates From Droplet Original Droplet CID (Most intense ion) m/z m/z MS Analysis (ESI Ion Trap) MS/MS data MS data CID = Collision Induced Dissociation

y8 y7 y5 y4 y3 y6 y2 Val Phe Gly Lxx Lxx Asp Glu Asp Lys b5 b2 b3 b4 b6 b7 b8 Tryptic fragment: y3 b2 100 391.1 247.0 95 90 789.3 y7 b3 85 80 a2 304.0 y4 75 219.0 506.2 b4 70 417.2 65 60 Lxx Lxx Phe Asp Glu Gly 55 y8 789.3 50 y5 Relative Abundance 936.4 45 619.2 y6 40 b5 732.2 35 530.2 b8 30 b6 248.1 25 889.4 645.3 418.1 305.1 y2 20 b7 937.4 15 262.1 774.4 431.1 372.2 468.4 205.0 10 318.1 904.5 5 0 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 m/z Example MS/MS spectrum

Interpretation of MS/MS data • Direct interpretation ("de novo sequencing") • spectrum must be of good quality • the only identification method if the spectrum is not in the database • can give useful information (partial sequence) for database search • General approach for database searching: • extract from the database all peptides that have the same mass as the precursor ion of the uninterpreted spectrum • compare each of them them to the uninterpreted spectrum • select the peptide that is most likely to have produced the observed data • MASCOT: • simple probabilistic model • calculate the probability that a peptide could have produced the given spectrum by chance

Interpretation of MS/MS data • SEQUEST: • generate a predicted spectrum for each potential peptide using a simple fragmentation model (all b and y ions have the same intensity; possible losses from b and y have a lower intensity) • compute a "cross-correlation" score and find the best-matching peptide • since this operation is very time-consuming, a simpler preliminary score is used to find the 500 peptides in the database that are most likely to be the correct identification

y4 100 VLSIGDGIAR y4 Relative Abundance 50 0 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 m/z An unusual spectrum SEQUEST correct sequence is not in the top 10 scoring peptides MASCOT correct sequence is the 2nd scoring peptide

Intermediate conclusions • All current MS/MS database search algorithms use a simplified fragmentation model: "peptides fragment in an uniform manner under low-energy collision induced dissociation (CID) conditions" • This approach works well for identifying most peptides • Several peptides exhibit fragment ions that differ greatly from this simple model • Those peptides often yield low or insignificant scores, thus preventing a positive identification • A better understanding of the fragmentation of peptides in the gas phase is required to build more robust search engines.

How does a peptide fragment ? • Peptides usually fragment at their amide (=peptide) bond, producing b and y ions • ‘mobile proton’ hypothesis: cleavage is initiated by migration of the charge from the initial site of protonation • aRginine (a very basic residue) can sequester a proton • Other basic residues (Lys, His) can hinder proton mobility • If no mobile proton is available: • peptide will usually fragment poorly • other fragmentation mechanisms take precedence • cleavage at Asp-Xaa = cD  (which we saw two slides back) • cleavage at Glu-Xaa = cE  + + VLSIGDGIAR

Fragmentation example y8 + ox Pe b11 + VFIMDNCEELIPEYLNFIR y8 • nP cleavage • cP cleavage b11 b10 y9 y6 y5 y4 Pe (pyridylethyl cysteine) = loss from C; ox = metox (methionine sulfoxide) = loss from M

Fragmentation example, II -CH3SOH ox Pe b6 + + ~ ~ RVFIMDNCEELIPEYLNFIR y6 y14 y11 y10 • metox loss • Pe loss • cD cleavage • cE cleavage • nP cleavage -Pe - (CH3SOH + Pe) MDNCE y14 y6 b6 y11 y8 Difficult to interpret due to N- and C-terminal aRginines.

Factors influencing fragmentation • Some factors have been known for a long time: • Xaa-Pro (nP) cleavage usually enhanced • Asp-Xaa (cD) enhanced when no mobile proton is available • Several recent attempts to improve this knowledge • Concentrated only on small subsets of data • Breci et al. • database of 168 Pro-containing peptides • analyse fragmentation at the Xaa-Pro (nP) bond • most abundant ions observed when Xaa is Val, His, Asp, Ile and Leu • Tabb et al. • determined if residues are more likely to cleave on their N rather than their C-terminal • Huang et al. • analysis of 505 doubly-charged tryptic peptides • cleavage at Asp-Xaa (cD) is more prominent for peptides that also contain an internal histidine residue

Find factors influencing fragmentation • Data: • about 11,000 spectra from an Ion-Trap mass spectrometer • identified using SEQUEST • manually validated to ensure correct identification • 5,500 unique sequences • Preliminary calculations: Cleavage Intensity Ratios (CIR) CIR < 1 = 1 > 1 Cleavage Reduced Average Enhanced

Quantifying the Asp-Xaa (cD) bond cleavage Partially-Mobile Mobile Non-Mobile 1+ - K1 2.37 (238) R1 5.10 (126) K1 K2 0.81 (358) 1.66 (276) 2+ R2 4.96 (92) R1 1.04 (316) K1R1 2.06 (301) H1K1 0.88 (54) H1K1R1 1.63 (79) K1R1 0.91 (37) 3+ R3 3.63 (12) K1R2 2.51 (21) R2 1.31 (24) H1K2R1 1.94 (23) H1K1R2 2.71 (10) Entries: average CIR (#peptides), stratified by # basic residues ‘Relative Proton Mobility’ Scale If number of Arg residues ≥ number of charges Non-mobile If number of Arg, Lys & His < number of charges Mobile otherwise they are designated Partially-mobile

Influence on scoring • Already known: The charge state has an influence on search scores • Proton mobility also influences search scores Dashed line: Currently accepted cut-off; below not identified w/o manual interv.

Find factors influencing fragmentation,II • Data categorized into 9 different strata, according to • charge state (1, 2 or 3+) • ‘relative proton mobility’ scale • Each spectrum was individually normalised

Find factors influencing fragmentation,III • Intensity at cleavage Xaa-Yaa is modeled by:log(intensity of the cleavage) = baseline cleavage intensity + increase/decrease due to residue on C-term (Xaa) + increase/decrease due to residue on N-term (Yaa) +  (pos) +  (pos2) +  log2(peptide length) • where • intensity of the cleavage = sum of intensities of all ions (b, y, etc) produced by cleavage at this bond • baseline cleavage intensity = average cleavage intensity if no factor has a special effect on fragmentation • increase/decrease = indicator variables • pos = relative position of the cleavage inside the peptide (0..1) • log(peptide length) = accounts for the lower intensity, due to the normalisation process, of a given cleavage when it occurs in a longer peptide

Find factors influencing fragmentation, IV • Linear regression is performed to estimate the effect of each of these variables on the fragmentation process • Variable selection: ensure that only variables that have a real effect on the fragmentation process are retained • for each "side" (C or N), the factor that is the closest to the average intensity is removed from the model.In other words, one of the residues of each side is selected as the reference, the residue that "does nothing" • backward selection is then performed to remove all variables that are not significantly different from 0 (at the 1% level)

How to find factors influencing frag • The regression was always significant (i.e. at least one factor was significant) • In practice: • the pos and log(length) terms were always retained • in each regression, several residues were selected

Factors influencing fragmentation

Predicting ion intensities • Use the same kind of linear model as before • Fit separate models for the different types of ions that we want to predict • Currently, only b and y ions are predicted • Influence of residues and positional factors are taken into account for the prediction • This (and everything before) is valid only on an Ion-Trap mass spectrometer

Prediction example : LEGLTDEINFLR, 1+ Observed spectrum • ‘non-mobile’ peptide, which usually gives bad scores • correlation between observed and LM predicted spectrum: 0.97 SEQUEST prediction Prediction with LM

Testing our predictions • Predictions were tested on a set of 283 peptides not used for fitting the model • correlation between predicted and observed spectrum: median: 0.73, interquartile range: 0.27

Testing our predictions, II • Worst scoring peptide (correlation = -0.19): RAELEAK, doubly-charged • Explanation • Most peptides in the training set are tryptic peptides • Proton will usually sit at the C-terminal of the peptide (K) • Under this assumption, y-ions are usually more intense than b-ions • Because of the miscleavage, the proton actually sits at the N-terminal • Consequently, b-ions are more intense than y-ions • The model performs badly • Charge localisation should be taken into account

Ongoing work • More known effects (e.g. charge localisation) must be taken into account in the model, plus some interactions • Other effects, still unknown, also have an influence on the fragmentation, and should be looked for • Predict other ion series (neutral losses, etc) • Test if the predictions can help discriminate between correct and incorrect identifications • Build a new search algorithm that takes into account these predictive models

Conclusions • Prediction of spectra is becoming feasible • Better search algorithms are expected • The ‘relative proton mobility’ scale helps the interpretation of database search scores • Optimized thresholds can be used for different subsets of the data • It should improve the sensitivity and specificity of the identification process • These are important steps towards fully automated identification of peptide MS/MS data

Acknowledgments • Bioinformatics, WEHI • Frédéric Schütz • Dept. of Chemistry, Melbourne University • Richard O ’Hair • JPSLLudwig Institute, Melbourne • Eugene Kapp • James Eddes • Gavin Reid • Lisa Connolly • David Frecklington • Robert Moritz • Richard Simpson Part of this work will appear in Analytical Chemistry

Deriving statistical models for predicting MS/MS product ion intensities