1.15k likes | 1.16k Views
This text discusses the use of mass spectrometry in functional genomics and proteomics, focusing on the identification and quantitation of proteins. It covers the basics of mass spectrometry, enzymatic digestion, peptide identification, and database searching. The concept of de novo interpretation of mass spectra is also introduced.
E N D
Nobel Citation 2002 Bafna
Nobel Citation, 2002 Bafna
Peptide MS • Instrument software usually detects peaks, and computes features (peak, area, m/z…) m/z Bafna
MS versus Micro-array sample sample cDNA Protein/Peptide? • Unlike micro-array, peptide id is not trivial at the end of the MS experiment! • Identification is an important part of pre-processing Bafna
MS based proteomics • Identification • Identify all the proteins in the proteome, specific organelles, specific pathways, complexes… • Quantitation • Is a protein differentially-expressed in certain conditions? • Others • Protein 3D structure, protein protein interactions,… We will consider an informatics-centered perspective Bafna
Proteomics via MS Enzymatic Digestion (Trypsin) + Fractionation Q: Sufficient to identify peptides? Bafna
Protein Identification • Is identifying peptides sufficient? • Rough probability for co-occurrence of a 15-aa peptide? With higher accuracy instruments, it may be possible to do intact proteins as well. Bafna
The peptide backbone • The peptide backbone breaks to form • fragments with characteristic masses. • Amino-acid mass? • Residue mass? (aa mass – water) H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei Bafna
Ionization H+ H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei The peptide backbone breaks to form fragments with characteristic masses. Ionized parent peptide Bafna
Mass Spectrum Mass Spectrometry m/z Bafna
Tandem MS of peptides Secondary Fragmentation Ionized parent peptide Bafna
Tandem MS: Ionization H+ H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei The peptide backbone breaks to form fragments with characteristic masses. Ionized parent peptide Bafna
Fragment ion generation The peptide backbone breaks to form fragments with characteristic masses. H+ H...-HN-CH-CONH-CH-CO-NH-CH-CO-…OH Ri Ri+1 Ri-1 C-terminus N-terminus AA residuei-1 AA residuei AA residuei+1 Ionized peptide fragment Bafna
Ionization basics • Residue: amino-acid minus water • Prefix residue mass at a break (PRM) = sum of residue masses • Charged ions can be represented as offsets from PRMs Bafna
Tandem MS for Peptide ID 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity [M+2H]2+ 0 250 500 750 1000 Bafna m/z
Peak Assignment 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 Peak assignment implies Sequence (Residue tag) Reconstruction! y7 % Intensity [M+2H]2+ y5 b3 b4 y2 y3 b5 y4 y8 b8 b9 b6 b7 y9 0 250 500 750 1000 Bafna m/z
Database Searching for peptide ID • For every peptide from a database • Reject if it has the wrong mass, else: • Generate a hypothetical spectrum • Compute a correlation between observed and experimental spectra • Choose the best • Database searching is very powerful and is the de facto standard for MS. • Sequest, Mascot, Inspect, and many others …SARLSQETFSDLWKLLPENNVLSPLP…. Bafna
Modules for Peptide Id D S V I/F • Interpretation (D) • Input Spectrum • Output: all that can be extracted from the spectrum (peptides/tags/parent mass/charge) • Indexing/Filtering • Input: Db (set of peptides) • Output: pre-processing of the database, peptide subset. • Scoring • Input; peptide set, spectrum • Output: ranked list of scores • Validation • Significance of the top hit. Db Bafna
De novo interpretation of mass spectra D S V I/F • The so called de novo algorithms focus exclusively on the D module. • There is no database (I/F). • Limited scoring and validation • Important when no database exists! • Also important for db search Bafna
De Novo Interpretation: Example 100 200 300 400 500 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions y 2 y 1 b 1 b 2 M/Z Bafna
The simplest case • Suppose only (and all) the prefix ions were visible. Would identification be easy? • We have two problems: • There is a mix of b and y ions. Separating them is critical! • Other ions besides b,y, including neutral losses, noise and so on. We need to account for them. 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions S 88 G 145 E 274 K 402 Bafna
Separating b-, and y-ions is solved using a combinatorial formulation (forbidden pairs) • Separating b,y from all others is solved using a statistical approach. • Together, they form the basis for a de novo sequencer. Bafna
De Novo Interpretation: Example 100 200 300 400 500 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions Ion Offsets b=P+1 y=S+19=M-P+19 y 2 y 1 b 1 b 2 M/Z Bafna
Computing possible prefixes • We know the parent mass M=401. • Consider a mass value 88 • Assume that it is a b-ion, or a y-ion • If b-ion, it corresponds to a prefix of the peptide with residue mass 88-1 = 87. • If y-ion, y=M-P+19. • Therefore the prefix has mass • P=M-y+19= 401-88+19=332 • Compute all possible Prefix Residue Masses (PRM) for all ions. Bafna
Putative Prefix Masses • Only a subset of the prefix masses are correct. • The correct mass values form a ladder of amino-acid residues Prefix Mass M=401by 88 87 332 145 144 275 147 146 273 276 275 144 S G E K 0 87 144 273 401 Bafna
Spectral Graph • Each prefix residue mass (PRM) corresponds to a node. • Two nodes are connected by an edge if the mass difference is a residue mass. 87 G 144 Bafna
Spectral Graph 0 273 332 401 87 144 146 275 100 200 300 S G E K • Each peak, when assigned to a prefix/suffix ion type generates a unique prefix residue mass. • Spectral graph: • Each node u defines a putative prefix residue M(u). • (u,v) in E if M(v)-M(u) is the residue mass of an a.a. (tag) or 0. • Paths in the spectral graph correspond to a interpretation Bafna
Re-defining de novo interpretation 0 273 332 401 87 144 146 275 100 200 300 S G E K • Find a subset of nodes in spectral graph s.t. • 0, M are included • Each peak contributes at most one node (interpretation)(*) • Each adjacent pair (when sorted by mass) is connected by an edge (valid residue mass) • An appropriate objective function (ex: the number of peaks interpreted) is maximized 87 G 144 Bafna
Two problems 0 273 332 401 87 144 146 275 100 200 300 S G E K • Too many nodes. • A. Only a small fraction correspond to b/y ions (leading to true PRMs). • B. Even if the b/y ions were correctly predicted, each peak generates multiple possibilities, only one of which is correct. We need to find a path that uses each peak only once (algorithmic problem). • In general, the forbidden pairs problem is NP-hard Bafna
However,.. • The b,y ions have a special non-interleaving property • Consider pairs (b1,y1), (b2,y2) • Note that b1+y1 = b2+y2 • If (b1 < b2), then y1 > y2 Bafna
Non-Intersecting Forbidden pairs 100 0 400 200 • If we consider only b,y ions, ‘forbidden’ node pairs are non-intersecting, • The de novo problem can be solved efficiently using a dynamic programming technique. 332 300 87 S G E K Bafna
The forbidden pairs method • There may be many paths that avoid forbidden pairs. • We choose a path that maximizes an objective function, • EX: the number of peaks interpreted • Here we assume a function , which gives a score to a PRM. The score captures the likelihood that the PRM is correct. Bafna
The forbidden pairs method 332 100 300 0 400 200 87 • Sort the PRMs according to increasing mass values. • For each node u, f(u) represents the forbidden pair • Let m(u) denote the mass value of the PRM. f(u) u Bafna
D.P. for forbidden pairs • Consider all pairs u,v • m[u] <= M/2, m[v] >M/2 • Define S(u,v) as the best score of a forbidden pair path from 0->u, v->M • Is it sufficient to compute S(u,v) for all u,v? 332 100 300 0 400 200 87 u v Bafna
D.P. for forbidden pairs • Note that the best interpretation is given by 332 100 300 0 400 200 87 u v Bafna
D.P. for forbidden pairs • Denote the forbidden pair of node v by f(v). • What is f(f(v))? • Note that we have one of two cases. • Either u < f(v) (and f(u) > v) • Or, u > f(v) (and f(u) < v) • Case 1. • Extend v, do not touch f(u) 100 300 0 f(u) 400 200 u Bafna v w
The complete algorithm for all u /*increasing mass values from 0 to M/2 */ for all v /*decreasing mass values from M to M/2 */ if (u > f[v]) else if (u < f[v]) If (u,v)E /*maxI is the score of the best interpretation*/ maxI = max {maxI,S[u,v]} Bafna
De Novo: Second issue • Given only b,y ions, a forbidden pairs path will solve the problem. • However, recall that there are MANY other ion types. • Typical length of peptide: 15 • Typical # peaks? 50-150? • #b/y ions? • Most ions are “Other” • a ions, neutral losses, isotopic peaks…. Bafna
De novo: Weighting nodes in Spectrum Graph • Factors determining if the ion is b or y • Intensity • Support ions • b- and y-ions are the most likely ions to lose water/ammonia • Isotopic peaks Bafna
Offset frequency function • b, and y-ions show offsets due to neutral losses Bafna
A simple example of a Bayesian scoring model • Classify all peaks as absent (X), low intensity, or high-intensity. • Suppose we see the following supporting peaks for a mass value • Low b • High y • Absent a • Low y-18 • We are interested in Pr(m| supporting peaks) • Through a Bayesian inversion, we say that • Pr(supporting peaks|m) αPr(m| supporting peaks) Pr(m) Bafna
A simple example of a Bayesian scoring model b a y y-H2O Bafna
A simple example of a Bayesian scoring model b a y y-H2O Bafna
BN scoring • Create tables of joint occurrences of ions using previously annotated spectra • Use these to score the BN Bafna
Weighting nodes • A probabilistic network to model support ions (Pepnovo) Bafna
De Novo Interpretation Summary • The main challenge is to separate b/y ions from everything else (weighting nodes), and separating the prefix ions from the suffix ions (Forbidden Pairs). • As always, the abstract idea must be supplemented with many details. • Noise peaks, incomplete fragmentation • A PRM is first scored on its likelihood of being correct, and the forbidden pair method is applied subsequently. Bafna
A quick survey of other modules D S V I/F • The D module is the only module used in de novo algorithms • In database search, most of the attention is devoted to scoring and validation. • A variant of sequence alignment can be used for scoring, along with statistical inference. Db Bafna
A brief overview of scoring with PTMs • Each node corresponds to the match of a peak with a sequence prefix, and is associated with a score. • The score of each node is defined through a statistical learning of spectral peaks. • The score of a peptide is the best scoring path, and can be computed using DP. • Shifts correspond to post-translational modifications • Score features: • Peak intensity • Residue composition • Support from neutral losses Bafna
The I/F module D S V I/F • De novo tags can be used to index the database. Only the peptides that match the tags are selected for scoring. Bafna