1.51k likes | 1.54k Views
Proteomics via Mass Spectrometry (a bioinformatics perspective). Vineet Bafna www.cse.ucsd.edu/~vbafna. Nobel Citation 2002. Nobel Citation, 2002. Proteomics via MS. Enzymatic Digestion (Trypsin) +. Fractionation. Q: Sufficient to identify peptides?. Peptide MS.
E N D
Proteomics via Mass Spectrometry (a bioinformatics perspective) Vineet Bafna www.cse.ucsd.edu/~vbafna Bafna
Nobel Citation 2002 Bafna
Nobel Citation, 2002 Bafna
Proteomics via MS Enzymatic Digestion (Trypsin) + Fractionation Q: Sufficient to identify peptides? Bafna
Peptide MS • Instrument software usually detects peaks, and computes features (peak, area, m/z…) m/z Bafna
Single Stage MS Mass Spectrometry Bafna
MS versus Micro-array sample sample cDNA Protein/Peptide? • Unlike micro-array, peptide id is not trivial at the end of the MS experiment! • Identification is an important part of pre-processing Bafna
MS based proteomics • Identification • Identify all the proteins in the proteome, specific organelles, specific pathways, complexes… • Quantitation • Is a protein differentially-expressed in certain conditions? • Others • Protein 3D structure, protein protein interactions,… We will consider an informatics-centered perspective Bafna
Protein Identification • The preferred mode is through tandem mass spectrometry of peptides. • Is identifying peptides sufficient? • Rough probability for co-occurrence of a 15-aa peptide? With higher accuracy instruments, it may be possible to do intact proteins as well. Bafna
Tandem MS of peptides Secondary Fragmentation Ionized parent peptide Bafna
The peptide backbone The peptide backbone breaks to form fragments with characteristic masses. H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei Bafna
Ionization H+ H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei The peptide backbone breaks to form fragments with characteristic masses. Ionized parent peptide Bafna
Fragment ion generation The peptide backbone breaks to form fragments with characteristic masses. H+ H...-HN-CH-CONH-CH-CO-NH-CH-CO-…OH Ri Ri+1 Ri-1 C-terminus N-terminus AA residuei-1 AA residuei AA residuei+1 Ionized peptide fragment Bafna
Tandem MS for Peptide ID 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity [M+2H]2+ 0 250 500 750 1000 Bafna m/z
Peak Assignment 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 Peak assignment implies Sequence (Residue tag) Reconstruction! y7 % Intensity [M+2H]2+ y5 b3 b4 y2 y3 b5 y4 y8 b8 b9 b6 b7 y9 0 250 500 750 1000 Bafna m/z
Ion types, and offsets H+ H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei • P = prefix residue mass • S = Suffix residue mass • b-ions = P+1 • (NH2-CHR-CO-..-NH-CHR-CO(+)) • y-ions = S+19 • (NH3(+)-CHR-CO-..NH-CHR-COOH) • a-ions = P-27, and so on.. Bafna
MS Quiz: • Why aren’t all tandem MS peaks of the same intensity? • Do the intensities for a peptide vary from spectrum to spectrum? Bafna
Database Searching for peptide ID • For every peptide from a database • Reject if it has the wrong mass, else: • Generate a hypothetical spectrum • Compute a correlation between observed and experimental spectra • Choose the best • Database searching is very powerful and is the de facto standard for MS. • Sequest, Mascot, Inspect, and many others …SARLSQETFSDLWKLLPENNVLSPLP…. Bafna
So what’s new? • The Id picture is very simplistic. Only 20-30% of spectra are conclusively identified. • Many reasons: • Spectra are noisy. • Databases are incomplete. Sometimes, we need to do a de novo interpretation • Post-translational modifications. • Instrument performance is critical. • The algorithms for identification must be sensitive to these issues. • We present a systematic look at identification software. Bafna
Modules for Peptide Id D S V I/F • Interpretation (D) • Input Spectrum • Output: all that can be extracted from the spectrum (peptides/tags/parent mass/charge) • Indexing/Filtering • Input: Db (set of peptides) • Output: pre-processing of the database, peptide subset. • Scoring • Input; peptide set, spectrum • Output: ranked list of scores • Validation • Significance of the top hit. Db Bafna
De novo interpretation of mass spectra D S V I/F • The so called de novo algorithms focus exclusively on the D module. • There is no database (I/F). • Limited scoring and validation • Important when no database exists! • Also important for db search Bafna
De Novo Interpretation: Example 100 200 300 400 500 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions y 2 y 1 b 1 b 2 M/Z Bafna
The simplest case • Suppose only (and all) the prefix ions were visible. Would identification be easy? • We have two problems: • There is a mix of b and y ions. Separating them is critical! • Other ions besides b,y, including neutral losses, noise and so on. We need to account for them. 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions S 88 G 145 E 274 K 402 Bafna
Separating b-, and y-ions is solved using a combinatorial formulation (forbidden pairs) • Separating b,y from all others is solved using a statistical approach. • Together, they form the basis for a de novo sequencer. Bafna
De Novo Interpretation: Example 100 200 300 400 500 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions Ion Offsets b=P+1 y=S+19=M-P+19 y 2 y 1 b 1 b 2 M/Z Bafna
Computing possible prefixes • We know the parent mass M=401. • Consider a mass value 88 • Assume that it is a b-ion, or a y-ion • If b-ion, it corresponds to a prefix of the peptide with residue mass 88-1 = 87. • If y-ion, y=M-P+19. • Therefore the prefix has mass • P=M-y+19= 401-88+19=332 • Compute all possible Prefix Residue Masses (PRM) for all ions. Bafna
Putative Prefix Masses • Only a subset of the prefix masses are correct. • The correct mass values form a ladder of amino-acid residues Prefix Mass M=401 b y 88 87 332 145 144 275 147 146 273 276 275 144 S G E K 0 87 144 273 401 Bafna
Spectral Graph • Each prefix residue mass (PRM) corresponds to a node. • Two nodes are connected by an edge if the mass difference is a residue mass. 87 G 144 Bafna
Spectral Graph 0 273 332 401 87 144 146 275 100 200 300 S G E K • Each peak, when assigned to a prefix/suffix ion type generates a unique prefix residue mass. • Spectral graph: • Each node u defines a putative prefix residue M(u). • (u,v) in E if M(v)-M(u) is the residue mass of an a.a. (tag) or 0. • Paths in the spectral graph correspond to a interpretation Bafna
Re-defining de novo interpretation 0 273 332 401 87 144 146 275 100 200 300 S G E K • Find a subset of nodes in spectral graph s.t. • 0, M are included • Each peak contributes at most one node (interpretation)(*) • Each adjacent pair (when sorted by mass) is connected by an edge (valid residue mass) • An appropriate objective function (ex: the number of peaks interpreted) is maximized 87 G 144 Bafna
Two problems 0 273 332 401 87 144 146 275 100 200 300 S G E K • Too many nodes. • A. Only a small fraction correspond to b/y ions (leading to true PRMs). • B. Even if the b/y ions were correctly predicted, each peak generates multiple possibilities, only one of which is correct. We need to find a path that uses each peak only once (algorithmic problem). • In general, the forbidden pairs problem is NP-hard Bafna
However,.. • The b,y ions have a special non-interleaving property • Consider pairs (b1,y1), (b2,y2) • Note that b1+y1 = b2+y2 • If (b1 < b2), then y1 > y2 Bafna
Non-Intersecting Forbidden pairs 100 0 400 200 • If we consider only b,y ions, ‘forbidden’ node pairs are non-intersecting, • The de novo problem can be solved efficiently using a dynamic programming technique. 332 300 87 S G E K Bafna
The forbidden pairs method • There may be many paths that avoid forbidden pairs. • We choose a path that maximizes an objective function, • EX: the number of peaks interpreted • Here we assume a function , which gives a score to a PRM. The score captures the likelihood that the PRM is correct. Bafna
The forbidden pairs method 332 100 300 0 400 200 87 • Sort the PRMs according to increasing mass values. • For each node u, f(u) represents the forbidden pair • Let m(u) denote the mass value of the PRM. f(u) u Bafna
D.P. for forbidden pairs • Consider all pairs u,v • m[u] <= M/2, m[v] >M/2 • Define S(u,v) as the best score of a forbidden pair path from 0->u, v->M • Is it sufficient to compute S(u,v) for all u,v? 332 100 300 0 400 200 87 u v Bafna
D.P. for forbidden pairs • Note that the best interpretation is given by 332 100 300 0 400 200 87 u v Bafna
D.P. for forbidden pairs • Denote the forbidden pair of node v by f(v). • What is f(f(v))? • Note that we have one of two cases. • Either u < f(v) (and f(u) > v) • Or, u > f(v) (and f(u) < v) • Case 1. • Extend v, do not touch f(u) 100 300 0 f(u) 400 200 u Bafna v w
The complete algorithm for all u /*increasing mass values from 0 to M/2 */ for all v /*decreasing mass values from M to M/2 */ if (u > f[v]) else if (u < f[v]) If (u,v)E /*maxI is the score of the best interpretation*/ maxI = max {maxI,S[u,v]} Bafna
De Novo: Second issue • Given only b,y ions, a forbidden pairs path will solve the problem. • However, recall that there are MANY other ion types. • Typical length of peptide: 15 • Typical # peaks? 50-150? • #b/y ions? • Most ions are “Other” • a ions, neutral losses, isotopic peaks…. Bafna
De novo: Weighting nodes in Spectrum Graph • Factors determining if the ion is b or y • Intensity • Support ions • b- and y-ions are the most likely ions to lose water/ammonia • Isotopic peaks Bafna
Offset frequency function • b, and y-ions show offsets due to neutral losses Bafna
De novo: Weighting nodes • A probabilistic network to model support ions (Pepnovo) Bafna
De Novo Interpretation Summary • The main challenge is to separate b/y ions from everything else (weighting nodes), and separating the prefix ions from the suffix ions (Forbidden Pairs). • As always, the abstract idea must be supplemented with many details. • Noise peaks, incomplete fragmentation • A PRM is first scored on its likelihood of being correct, and the forbidden pair method is applied subsequently. Bafna
Db search versus de novo interpretation Db 55M peptides Filter Score Validation Traditional db search simply have the scoring module. De novo is useful when the peptide is not in the database, but not as accurate. It can be thought of as a database search over a much larger database. PT modifications change the picture . De novo Bafna
Filtering Candidate Peptides (700) Db 55M peptides Filter extension Score Validation De novo Db indexing/filtering is a key mechanism for reducing the search space Bafna
Filtering • Define a filter as a computational tool that rapidly screens a database, removing much of it but retaining the true peptide. • Can you suggest commonly used filters? • Parent mass • Trypsin digested peptides Bafna
Parent Mass filter • Sort all peptides in the database by their parent mass. • Search only the peptides that are within some mass tolerance. • The filter does not work when you have modifications. Bafna
The dynamic nature of the proteome • The proteome of the cell is changing • Various extra-cellular, and other signals activate pathways of proteins. • A key mechanism of protein activation is PT modification • These pathways may lead to other genes being switched on or off • Mass Spectrometry is key to probing the proteome Bafna
Db search for putatively modified peptides. • Ex:YFDSTDYNMAK • 25=32 possibilities, with 2 types of modifications! • In contrast, de novo search space does not change significantly. oxidation Phosphorylation? For each peptide, generate all mods. Score each modification Is parent mass still a good filter? Bafna