INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS

INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS • The MS/MS identification problem can be formulated as: • Given a set of MS/MS spectra R={R1,...,Rn}, • resulting from the peptides P={P1,...,Pn} (generally with known m/z), • and a set of protein database sequences D={D1,...,Dm}, • Find (identify) from which of the sequences in D the peptides come, if any. • For each peptide we have four alternatives. • The peptide comes from a protein with sequence in D. • The peptide comes from a protein homologous to a sequence in D • The peptide comes from a "unique" unknown protein, with no homologous in D • Each of the cases above may include modifications, the spectrum may come from a peptide which is modified. • The searching is then performed by in principle to compare an experimental spectrum to each segment in the database, and identify the segment(s) which match the spectrum best. • As there may be millions of segments in a sequence database, some form of filtering should be used. The mass of the precursor is one such filter, others are described in following chapters. INF380 - Proteomics-9

Three different approaches • One way of classifying the different approaches for protein identification is to consider how the basic comparison (comparing an experimental spectrum R to a segment S) is performed. • We choose to divide the methods into three approaches: • Spectral; Compare spectra; make theoretical spectrum T of the segment, and compare the experimental one to the theoretical one. • Sequential; Compare sequences; perform de novo sequencing of the spectrum to obtain a derived sequence V. Then compare the derived sequence to the segment. • Threading; Compare spectrum to segment; either the spectrum is ``threaded'' on the segment, or reverse. This approach includes a few special methods It will however, be clear when learning these methods that they have much in common with either the spectral ones or the sequential ones. INF380 - Proteomics-9

Three different approaches • For all approaches scoring the matches is essential, and for MS/MS comparison the scoring scheme strongly depends on the matching approach. • The problem can now be formulated as: • given a set of MS/MS spectra R={R1,...,Rn} • and a set of segments (theoretical peptides) S={S1,...,Sm}, • find the segment(s) that best match each spectrum. • We can consider two types of (spectrum, segment) comparisons • Straight comparison means finding the segment(s) in S that best match a given spectrum when no transformations are considered. • Transformed comparison means finding the segment(s) that best match a given spectrum, when at most k operations have been performed on the peptide that is the origin of the spectrum. Operations are all types of modifications and the mutations (substitutions, insertions and deletions). • The simplest case is when k=1, and the mass of the transformation is known. INF380 - Proteomics-9

Effect of operations (modifications - mutations) on spectra • A modification at residue i means that there is a mass shift in the b-series of ions bi to bn-1, and in the y-series of yn-i+1 to yn-1, where n is the number of residues in the peptide. • Consider a complete spectrum RP of peptide P. Complete means that all ions of the considered fragment types are produced. • Consider also another spectrum RPM where P is modified at residue i. • Then the b-ions of RP become equal the b-ions of RPM by shifting bi,...bn-1 a distance corresponding to the mass of the modification. • A modification will maintain the same number of peaks. • A substitution have the same effect as a modification, since it means changing the mass of one residue. • Insertions and deletions will also result in shifting of peaks, but in addition the number of peaks is changed. • This means that comparing spectra when taking modifications/mutations into account can be done by considering shifting some of the peaks, but also to consider removing/insertion of peaks. INF380 - Proteomics-9

Effect of operations (modifications - mutations) on spectra • Consider the spectrum in the figure from the unmodified peptide LICDVTR • Assume a phosphorylation of D, then the peaks for b5 and y5 are shifted 80 units right, as shown in b). INF380 - Proteomics-9

Comparison including modifications • Most of the identification methods where modification/mutation(s) are taken into account restrict the searching to include only a few specified types, and ignore the others. • Then the searching can be performed by (several) straight comparisons, by ``changing'' the segments accordingly. • The possibility for methylization of cysteines can for example be performed by comparing the spectrum to segments using the ordinary mass for cysteines, and also where one or several of the cysteines have increased masses. • Such straightforward searching is however inefficient, and more intelligent search methods are developed. • In the last years there are developed a couple of methods which are able to identify unanticipated modifications. • This is called blind PTM identification (Post Translational Modification), they operate in a blind mode without specifying any modification before searching. INF380 - Proteomics-9

Filtering and organization of the database • A spectrum is in principle compared to all segments in the used database, which in practice would imply too many comparisons. • Therefore some form of filtering is necessary, meaning that only some of the segments are compared to the spectrum. • Such a filtering would include a trade-off between two desires, the one of not filtering out the correct segment, and the desire of filtering as many of the incorrect ones. • The most common filtering techniques are by use of precursor mass, and/or the used digesting protease. This can include modifications/mutations, either specific mass deviations are given, or a maximum deviation is specified. • Thus a set of masses, or an interval of masses, are given, and only segments satisfying this mass constraints are extracted for further comparisons. This is called mass filtering, and constraints the possibility for considering modifications. • Another filtering technique is to try to extract small amino acid sequences from the spectrum (typically of length two to four), and only process segments containing one of these small sequences. This is called sequence filtering, and can be combined with mass filtering. • Example • Suppose a peptide where the neutral mass is found to be 706 Da, and a produced MS/MS-spectrum have peaks at {129, 246, 276, 345, 363, 432, 533, 579, 636}. • If we expect that one phosphorylation may have happen, we can consider segments with calculated peptide masses 626 and 706 Da. • Considering mass differences between the peaks we see that 345-246=99, which is the residue mass of V, and 432-345=87, which is the residue mass of S. It is then reasonable to believe that the peptide contains the subsequence VS (assumed consecutive b-ions), or SV= (assumed y-ions). • For speeding up the filtering the database can be indexed, both for the mass filtering and the sequence filtering. INF380 - Proteomics-9

Scoring and statistical significance • Each search program uses some scheme for scoring the found matching segments, and the segment scoring highest is presumed to be the correct one. • A lot of scoring schemes are proposed, and the general discussion in earlier chapters concerning scoring schemes and statistical significance also yields here. INF380 - Proteomics-9

INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS