De Novo Peptide Sequencing: Algorithms & Applications

Algorithmic Problems in Peptide Sequencing

De Novo Sequencing for Peptide Identificaiton Outline • Basics of Proteomics • Roles and Anatomy of Proteins • Tandem Mass Spectrometry • Algorithms for Peptide Identifications • De Novo Sequencing • An Algorithm for Perfect Spectra • Peptide Identification in Real World • Discussions

De Novo Sequencing for Peptide Identificaiton Briefings • We mainly focus on the following result: • Ting Chen, Ming-Yang Kao, Matthew Tepel, John Rush and George Church, A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry, Journal of Computational Biology, 8(3): 325-337, 2001. • Its preliminary version also appears in The 11th Annual SIAM-ACM Symposium on Discrete Algorithms (SODA 2000), page 389-398, 2000. • One of the most-cited algorithm articles in the computational proteomics community.

De Novo Sequencing for Peptide Identificaiton Outline • Basics of Proteomics • Roles and Anatomy of Proteins • Tandem Mass Spectrometry • Algorithms for Peptide Identifications • De Novo Sequencing • An Algorithm for Perfect Spectra • An Improved Version • Peptide Identification in Real World • Discussions

De Novo Sequencing for Peptide Identificaiton Neutral peptide Residue (of the peptides) Anatomy of Protein Molecules H O H H O NH C C C OH NH C Rx Rx Stable state in nature Basic building blocks

De Novo Sequencing for Peptide Identificaiton O O O O H H H H C C C C N N C C R4 R4 arginine (R) or lysine (K) H H K 146.19 128.17 R 174.13 156.11 N C H H R3 O N C COOH H H H R5 C H2 N C C C N R1 H R2 O Proteins and Peptides O H H H H H C H2 N C C N C N C N C COOH H R1 R2 O R3 H R5 trypsin + H2O OH Rectangles stand for amino acid residues

De Novo Sequencing for Peptide Identificaiton Amino Acid Molecules • Please visit http://www.ionsource.com/ for more information.

De Novo Sequencing for Peptide Identificaiton Outline • Basics of Proteomics • Roles and Anatomy of Proteins • Tandem Mass Spectrometry • Algorithms for Peptide Identifications • De Novo Sequencing • An Algorithm for Perfect Spectra • Peptide Identification in Real World • Discussions

De Novo Sequencing for Peptide Identificaiton Sample + _ Detector Ionizer Mass Analyzer Tandem Mass Spectrometry • Mass Spectrometers measure the mass of charged ions. • A mass spectrometer has 3 major components. Adapted from Nathan Edwards’ slides

De Novo Sequencing for Peptide Identificaiton Proteomics via Mass Spectrometers Enzymatic Digest and Fractionation First stage MS MS/MS Precursor selection and dissociation Adapted from Nathan Edwards’ slides

De Novo Sequencing for Peptide Identificaiton Outline • Basics of Proteomics • Roles and Anatomy of Proteins • Tandem Mass Spectrometry • Algorithms for Peptide Identification • De Novo Sequencing • An Algorithm for Perfect Spectra • Peptide Identification in Real World • Discussions

De Novo Sequencing for Peptide Identificaiton Peptide Identification • Given: • A MS/MS spectrum (m/z, intensity, possibly along with its retention time) • The precursor mass • Output: • The amino-acid sequence of the peptide • Imagine a deck of cards that you can cut many times and obtains the sums of the upper or lower half

De Novo Sequencing for Peptide Identificaiton y-ions R E G L b-ions m/z L E R G Peptide Fragmentation Mechanism N-Terminus C-Terminus b-ions y-ions

De Novo Sequencing for Peptide Identificaiton Peaks in a Spectrum • Peptide: L – G – E – R

De Novo Sequencing for Peptide Identificaiton Manual De Novo Sequencing

De Novo Sequencing for Peptide Identificaiton M De Novo Sequencing • De Novo: From the beginning in Latin. • Database search tools match against known peptides. • Problem Definitions: • Given a spectrum ( a set of real intervals ), • a mass value M, • compute a sequence P, ( a set of real number with specific order) • s.t. m(P)=M, and the matching score is maximized. • m(P) is the sum of residue mass.

De Novo Sequencing for Peptide Identificaiton M De Novo Sequencing: An Ideal Case • An ideal tandem mass spectrum is noise-free and contains only b- and y-ions, and every mass peak has the same height. • The task is to find paths connecting two endpoints on a directed acyclic graph. • The problem is : how to construct the ion ladder?

De Novo Sequencing for Peptide Identificaiton y1 y3 y2 R E G L m/z L R G E Ion Ladders in an Ideal Case • Based on an ideal ion ladder, we can determine the sequence by concatenating prefixes (or suffixes) in order. • However, we cannot determine the ion type of a peak before identifying it. Given only L+ , ER+, LGE+, R+

De Novo Sequencing for Peptide Identificaiton NC-Spectrum Model • We generate a (superset of ) ladder of ions. • A Trick: Even if we cannot determine the ion types, we know that an ion is either b-ion or y-ion. • Assume that we want to generate b-ion ladder. • If a peak is a b-ion, add the peak value to the list. • If a peak is a y-ion, add the complementary b-ion value to the list. • This phase doubles the number of peaks.

De Novo Sequencing for Peptide Identificaiton GER LG Q2 Q1 Q4 Q3 0 m m/2 P1 P2 P3 P4 ER LGE L R NC-Spectrum Model • For the peptide sequence LGRE, we construct all possible b-ions with respect to current spectrum. • {P1, Q3, P4} or {P2, P3, Q1} are both complete ladders. Pi: observed peaks Qi: artificial peaks

De Novo Sequencing for Peptide Identificaiton NC-Spectrum Model • Given a peak list = {P1,P2,P3, … , Pk} • The coordinates of all points along the line: • Pk – 1 • Qk = M – Pk+1 (why?) • We still have to add two endpoints: • 0 • M– 18 Since the ion loses a Hydrogen (M – (Pk – 1 ) ) - 1

De Novo Sequencing for Peptide Identificaiton NC Spectrum Model: A Summary • We are given k peaks. • Now we have at most 2k+2 vertices. • Two vertices are adjacent if their coordinates differ by the weight of some amino acid. • The spectrum graph can be constructed in O(n2). (Why?) • The de novo sequencing is to search a path (or paths) representing a good path from coordinate 0 to M-18. • Such a path is not necessarily an ion ladder, though.

De Novo Sequencing for Peptide Identificaiton Dynamic Programming Strategy • Dynamic Programming can solve this problem efficiently. • Uni-directional (forward) DP does not work since it could produce a solution containing both candidates for each peak. Q2 Q1 Q4 Q3 0 m m/2 P1 P2 P3 P4

De Novo Sequencing for Peptide Identificaiton Dynamic Programming Strategy (Cont’d) • Dynamic Programming can solve this problem efficiently using a different encoding scheme. • We approach the middle part from both end sides. Q2 Q1 Q4 Q3 0 m m/2 P1 P2 P3 P4

De Novo Sequencing for Peptide Identificaiton Dynamic Programming Strategy (Cont’d) • Mass(b-ion) + Mass(y-ion) = PrecursorMass +2 • These b-ion candidates are nested pairs in the spectrum graph. 0 m m/2

De Novo Sequencing for Peptide Identificaiton Relabeling the Vertices • To encode the spectrum graph by the nested pairs, we need to relabel the vertex number. • {0 = x0, x1, x2, …, xk, yk, …, y2, y1, y0 = m} • xi and yi are both generated from the same peak. • We go one level further in each iteration. 0 m m/2 x0 xk yk y0

De Novo Sequencing for Peptide Identificaiton How Dynamic Programming Works • We design the |V|×|V| matrix M for representing partial path candidates. • M(i, j) = 1 iff [xo, xi] and [yj, yo] can occur simultaneouly in a legal path. • For 1≦ s ≦ i, 1 ≦ s ≦ j, s occurs exactly once in the determined partial path. ? xi yj 0 m m/2

De Novo Sequencing for Peptide Identificaiton How Dynamic Programming Works (Cont’d) x0 x1 x2 x3 x4 y4 y3 y2 y1 y0 m/2 m 0 M(0,0) = 1 x0 y0 M(0,1) = 1 x0 y1 y0 M(1,0) = 1 x0 x1 y0

De Novo Sequencing for Peptide Identificaiton M(0,1) = 1 x0 y1 y0 M(1,0) = 1 x0 x1 y0 How Dynamic Programming Works (Cont’d) x0 x1 x2 x3 x4 y4 y3 y2 y1 y0 m/2 m 0 M(2,0) = 0 x0 x1 x2 y0 • M(1,0) =1 , but we cannot reach x2 from x0 nor x1. M(2,1) = 1 x0 x2 y1 y0 • M(0,1) =1 , and we can reachx2 from x0.

De Novo Sequencing for Peptide Identificaiton M(0,1) = 1 x0 y1 y0 M(1,0) = 1 x0 x1 y0 How Dynamic Programming Works (Cont’d) x0 x1 x2 x3 x4 y4 y3 y2 y1 y0 m/2 m 0 M(0,2) = 0 x0 y2 y1 y0 • M(0,1) =1 , but we cannot reach y2 from y0 nor y1. M(1, 2) = 1 x0 x1 y2 y0 • M(1,0) =1 , and we can reach y2 from y0.

De Novo Sequencing for Peptide Identificaiton 0 m/2 Dynamic Programming: Preview • In the i-th iteration, we determine and record all possible (partial) paths in [0, xi] and [ yi, m]. m … … xi-1 y0 x0 yt xi or yi? t < i-1 … … xi-1 x0 yt y0 xi yi

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: Preview(Cont’d) Path extension • How can we reach yi? • To calculate M(xj, yi) for all j < i, • For every j < i, check if yi is adjacent to yt and M(xj, yt) = 1, for some t < i • Then M(xj, yi) = 1. Otherwise, it is 0. … … xj y0 x0 yi yt … … xj x0 yi yt y0

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: Preview(Cont’d)Path extension • Similarly, how can we reach xi? • To calculate M(xi, yj) for all j < i, • For every j < i, check if xi is adjacent to xt and M(xt, yj) = 1, for some t < i • Then define M(xi, yj) =1. … … y0 x0 xt xi yj … … xt x0 xi yj y0

De Novo Sequencing for Peptide Identificaiton Dynamic Programming m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: Initialization m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: 1st iteraton We then compute M(1,0) and M(0,1). m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0 Check the arcs (x0, x1) and (y1, y0)

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: Recursion (a) For j = 2 to k For i = 0 to j-2 (a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1. m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0 Can we adjust the leftmost endpoint to xj?

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: Recursion (b) For j = 2 to k For i = 0 to j-2 (b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1. m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0 Can we adjust the rightmost endpoint to yj?

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: Recursion (c) For j = 2 to k For i = 0 to j-2 (c) If M(j-1,i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1. m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0 Can we adjust the leftmost endpoint to xj?

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: Recursion (d) For j = 2 to k For i = 0 to j-2 (d) If M(j-1, i) = 1 and edge(Yi, Yj) = 1, then M(j-1, j) = 1. m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0 Can we adjust the rightmost endpoint to yj?

De Novo Sequencing for Peptide Identificaiton Dynamic Programming (Cont’d) Now for j = 3 m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0

De Novo Sequencing for Peptide Identificaiton Dynamic Programming (Cont’d) Now for j = 4 m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: Constructing the Answer • Legal path: Starting our search from the outermost regions ( the last row/column): • [x4, y4] -> [x3, y3] -> [x2, y2] ->[x1, y1] • We backtrack M to search each edge corresponding to the feasible solution m/2 m 0 x0 x1 x2 x3 x4 y4 y3 y2 y1 y0

De Novo Sequencing for Peptide Identificaiton Dynamic Programming: Review • Chen et al. create a new NC-specturm graph G=(V, E), where V=2k+2 and k is the number of mass peaks (ions). • Given the NC-spectrum graph, we can solve the idealde novo peptide sequencing problem in O(|V|2) time and O(|V|2) space. • M construction : O(|V|2) time • Constructing a feasible solution : O(|V|) time • Therefore we find a feasible solution in O(|V|2) time and O(|V|2) space.

De Novo Sequencing for Peptide Identificaiton Noises in Real Spectra • The de novo strategy is too fragile to handle frequent errors. • False negative peaks • Missing ions will break the path. The algorithms may find wrong paths by concatenating two partial paths. • False positive peaks • The main critique of de novo strategy • Peak value is not the ion mass • Peak values represent the mass over charge value of ions. • It relies on the vendor. (Applied Biosystem)

De Novo Sequencing for Peptide Identificaiton False Positives in Real Spectra • Different types of ions • a-x, b-y, c-z • Internal fragments/immonium ions • Neutral losses • Neutral loss of water (~18Da) • Neutral loss of ammonia (~17Da) • PTM (like adding new letters) • Phosphorylation, glycopeptides • Isotopes • Unpurified samples

De Novo Sequencing for Peptide Identificaiton Database Search Tools • MASCOT: http://www.matrixscience.com/ • The de facto identification tool

De Novo Sequencing for Peptide Identificaiton Database Search Tools (Cont’d) • Brian Searle of Proteome Software informs us:

De Novo Peptide Sequencing: Algorithms & Applications