Week 4: Functional Genomics via Mass Spectrometry

Week 4: Functional Genomics via Mass Spectrometry Bafna

Peptide MS • Instrument software usually detects peaks, and computes features (peak, area, m/z…) m/z Bafna

Nobel Citation 2002 Bafna

Nobel Citation, 2002 Bafna

MS based proteomics • Identification • Identify all the proteins in the proteome, specific organelles, specific pathways, complexes… • Quantitation • Is a protein differentially-expressed in certain conditions? • Others • Protein 3D structure, protein protein interactions,… We will consider an informatics-centered perspective Bafna

MS versus Micro-array sample sample cDNA Protein/Peptide? • Unlike micro-array, peptide id is not trivial at the end of the MS experiment! • Identification is an important part of pre-processing Bafna

Proteomics via MS Enzymatic Digestion (Trypsin) + Fractionation Q: Sufficient to identify peptides? Bafna

Protein Identification • Is identifying peptides sufficient? • Rough probability for co-occurrence of a 15-aa peptide? With higher accuracy instruments, it may be possible to do intact proteins as well. Bafna

The peptide backbone • The peptide backbone breaks to form • fragments with characteristic masses. • Amino-acid mass? • Residue mass? (aa mass – water) H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei Bafna

Ionization H+ H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei The peptide backbone breaks to form fragments with characteristic masses. Ionized parent peptide Bafna

Mass Spectrum Mass Spectrometry m/z Bafna

Tandem MS of peptides Secondary Fragmentation Ionized parent peptide Bafna

Tandem MS: Ionization H+ H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei The peptide backbone breaks to form fragments with characteristic masses. Ionized parent peptide Bafna

Fragment ion generation The peptide backbone breaks to form fragments with characteristic masses. H+ H...-HN-CH-CONH-CH-CO-NH-CH-CO-…OH Ri Ri+1 Ri-1 C-terminus N-terminus AA residuei-1 AA residuei AA residuei+1 Ionized peptide fragment Bafna

Ionization basics • Residue: amino-acid minus water • Prefix residue mass at a break (PRM) = sum of residue masses • Charged ions can be represented as offsets from PRMs Bafna

Tandem MS for Peptide ID 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity [M+2H]2+ 0 250 500 750 1000 Bafna m/z

Peak Assignment 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 Peak assignment implies Sequence (Residue tag) Reconstruction! y7 % Intensity [M+2H]2+ y5 b3 b4 y2 y3 b5 y4 y8 b8 b9 b6 b7 y9 0 250 500 750 1000 Bafna m/z

Database Searching for peptide ID • For every peptide from a database • Reject if it has the wrong mass, else: • Generate a hypothetical spectrum • Compute a correlation between observed and experimental spectra • Choose the best • Database searching is very powerful and is the de facto standard for MS. • Sequest, Mascot, Inspect, and many others …SARLSQETFSDLWKLLPENNVLSPLP…. Bafna

Modules for Peptide Id D S V I/F • Interpretation (D) • Input Spectrum • Output: all that can be extracted from the spectrum (peptides/tags/parent mass/charge) • Indexing/Filtering • Input: Db (set of peptides) • Output: pre-processing of the database, peptide subset. • Scoring • Input; peptide set, spectrum • Output: ranked list of scores • Validation • Significance of the top hit. Db

De novo interpretation of mass spectra D S V I/F • The so called de novo algorithms focus exclusively on the D module. • There is no database (I/F). • Limited scoring and validation • Important when no database exists! • Also important for db search

De Novo Interpretation: Example 100 200 300 400 500 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions y 2 y 1 b 1 b 2 M/Z

The simplest case • Suppose only (and all) the prefix ions were visible. Would identification be easy? • We have two problems: • There is a mix of b and y ions. Separating them is critical! • Other ions besides b,y, including neutral losses, noise and so on. We need to account for them. 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions S 88 G 145 E 274 K 402

Separating b-, and y-ions is solved using a combinatorial formulation (forbidden pairs) • Separating b,y from all others is solved using a statistical approach. • Together, they form the basis for a de novo sequencer.

De Novo Interpretation: Example 100 200 300 400 500 0 88 145 274 402 b-ions S G E K 420 333 276 147 0 y-ions Ion Offsets b=P+1 y=S+19=M-P+19 y 2 y 1 b 1 b 2 M/Z

Computing possible prefixes • We know the parent mass M=401. • Consider a mass value 88 • Assume that it is a b-ion, or a y-ion • If b-ion, it corresponds to a prefix of the peptide with residue mass 88-1 = 87. • If y-ion, y=M-P+19. • Therefore the prefix has mass • P=M-y+19= 401-88+19=332 • Compute all possible Prefix Residue Masses (PRM) for all ions.

Putative Prefix Masses • Only a subset of the prefix masses are correct. • The correct mass values form a ladder of amino-acid residues Prefix Mass M=401by 88 87 332 145 144 275 147 146 273 276 275 144 S G E K 0 87 144 273 401

Spectral Graph • Each prefix residue mass (PRM) corresponds to a node. • Two nodes are connected by an edge if the mass difference is a residue mass. 87 G 144

Spectral Graph 0 273 332 401 87 144 146 275 100 200 300 S G E K • Each peak, when assigned to a prefix/suffix ion type generates a unique prefix residue mass. • Spectral graph: • Each node u defines a putative prefix residue M(u). • (u,v) in E if M(v)-M(u) is the residue mass of an a.a. (tag) or 0. • Paths in the spectral graph correspond to a interpretation

Re-defining de novo interpretation 0 273 332 401 87 144 146 275 100 200 300 S G E K • Find a subset of nodes in spectral graph s.t. • 0, M are included • Each peak contributes at most one node (interpretation)(*) • Each adjacent pair (when sorted by mass) is connected by an edge (valid residue mass) • An appropriate objective function (ex: the number of peaks interpreted) is maximized 87 G 144

Two problems 0 273 332 401 87 144 146 275 100 200 300 S G E K • Too many nodes. • A. Only a small fraction correspond to b/y ions (leading to true PRMs). • B. Even if the b/y ions were correctly predicted, each peak generates multiple possibilities, only one of which is correct. We need to find a path that uses each peak only once (algorithmic problem). • In general, the forbidden pairs problem is NP-hard

However,.. • The b,y ions have a special non-interleaving property • Consider pairs (b1,y1), (b2,y2) • Note that b1+y1 = b2+y2 • If (b1 < b2), then y1 > y2

Non-Intersecting Forbidden pairs 100 0 400 200 • If we consider only b,y ions, ‘forbidden’ node pairs are non-intersecting, • The de novo problem can be solved efficiently using a dynamic programming technique. 332 300 87 S G E K

The forbidden pairs method • There may be many paths that avoid forbidden pairs. • We choose a path that maximizes an objective function, • EX: the number of peaks interpreted • Here we assume a function , which gives a score to a PRM. The score captures the likelihood that the PRM is correct.

The forbidden pairs method 332 100 300 0 400 200 87 • Sort the PRMs according to increasing mass values. • For each node u, f(u) represents the forbidden pair • Let m(u) denote the mass value of the PRM. f(u) u

D.P. for forbidden pairs • Consider all pairs u,v • m[u] <= M/2, m[v] >M/2 • Define S(u,v) as the best score of a forbidden pair path from 0->u, v->M • Is it sufficient to compute S(u,v) for all u,v? 332 100 300 0 400 200 87 u v

D.P. for forbidden pairs • Note that the best interpretation is given by 332 100 300 0 400 200 87 u v

D.P. for forbidden pairs • Denote the forbidden pair of node v by f(v). • What is f(f(v))? • Note that we have one of two cases. • Either u < f(v) (and f(u) > v) • Or, u > f(v) (and f(u) < v) • Case 1. • Extend v, do not touch f(u) 100 300 0 f(u) 400 200 u v w

The complete algorithm for all u /*increasing mass values from 0 to M/2 */ for all v /*decreasing mass values from M to M/2 */ if (u > f[v]) else if (u < f[v]) If (u,v)E /*maxI is the score of the best interpretation*/ maxI = max {maxI,S[u,v]}

De Novo: Second issue • Given only b,y ions, a forbidden pairs path will solve the problem. • However, recall that there are MANY other ion types. • Typical length of peptide: 15 • Typical # peaks? 50-150? • #b/y ions? • Most ions are “Other” • a ions, neutral losses, isotopic peaks….

De novo: Weighting nodes in Spectrum Graph • Factors determining if the ion is b or y • Intensity • Support ions • b- and y-ions are the most likely ions to lose water/ammonia • Isotopic peaks

Offset frequency function • b, and y-ions show offsets due to neutral losses

A simple example of a Bayesian scoring model • Classify all peaks as absent (X), low intensity, or high-intensity. • Suppose we see the following supporting peaks for a mass value • Low b • High y • Absent a • Low y-18 • We are interested in Pr(m| supporting peaks) • Through a Bayesian inversion, we say that • Pr(supporting peaks|m) αPr(m| supporting peaks) Pr(m) Bafna/Ideker

A simple example of a Bayesian scoring model b a y y-H2O Bafna/Ideker

BN scoring • Create tables of joint occurrences of ions using previously annotated spectra • Use these to score the BN Bafna/Ideker

Weighting nodes • A probabilistic network to model support ions (Pepnovo)

De Novo Interpretation Summary • The main challenge is to separate b/y ions from everything else (weighting nodes), and separating the prefix ions from the suffix ions (Forbidden Pairs). • As always, the abstract idea must be supplemented with many details. • Noise peaks, incomplete fragmentation • A PRM is first scored on its likelihood of being correct, and the forbidden pair method is applied subsequently.

A quick survey of other modules D S V I/F • The D module is the only module used in de novo algorithms • In database search, most of the attention is devoted to scoring and validation. • A variant of sequence alignment can be used for scoring, along with statistical inference. Db

A brief overview of scoring with PTMs • Each node corresponds to the match of a peak with a sequence prefix, and is associated with a score. • The score of each node is defined through a statistical learning of spectral peaks. • The score of a peptide is the best scoring path, and can be computed using DP. • Shifts correspond to post-translational modifications • Score features: • Peak intensity • Residue composition • Support from neutral losses Bafna

The I/F module D S V I/F • De novo tags can be used to index the database. Only the peptides that match the tags are selected for scoring.

Week 4: Functional Genomics via Mass Spectrometry