700 likes | 977 Views
An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry. Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A. Collaborators of This Project. University of Southern California Ting Chen Harvard Medical School
E N D
An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A.
Collaborators of This Project • University of Southern California • Ting Chen • Harvard Medical School • George M. Church • John Rush • Matthew Tepel
Genome (DNA) Transcriptome (RNA) Proteome (Protein) Perspectives A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes. • Genome: entire sets of materials in the chromosomes. • Transcriptome: entire sets of gene transcripts. • Proteome: entire sets of proteins.
Genome (DNA) Transcriptome (RNA) Proteome (Protein) Perspectives A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes. • Genome: entire sets of materials in the chromosomes. • Transcriptome: entire sets of gene transcripts. • Proteome: entire sets of proteins. this talk’s focus
Proteomics • Proteome: all proteins encoded within a genome • half millions distinct proteins (temporal, spatial, modifications) • ~30,000 human genes • mRNA and protein expressions may not correlate • Proteomics:study of protein expression by biological systems • relative abundance and stability; post-translational modifications • fluctuations as a response to environment and altered cellular needs • correlations between protein expression and disease state • protein-protein interactions, protein complexes • Technologies: • 2D gel electrophoresis • mass spectrometry • yeast two-hybrid system • protein chips this talk’s focus
A Key Step of Proteomics • How to sequence proteins? • How to sequence protein peptides? (this talk’s focus)
Outline of This Talk • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions
Outline of This Talk (1) • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions
Protein Identification: HPLC-MS-MS Peptides Proteins B-ions / Y-ions One Peptide Mass/Charge Mass/Charge Tandem Mass Spectrum
Protein Identification: HPLC-MS-MS Peptides Proteins B-ions / Y-ions One Peptide Mass/Charge Mass/Charge Tandem Mass Spectrum
Peptide Fragmentation and Ionization B-ion Y-ion Complementary: Mass(B-ion)+Mass(Y-ion) = Mass(peptide)+4H+O
B-ions and Y-ions Fragmentation
Tandem Mass Spectrum 100 175.113 361.121 448.225 Abundance (100%) 88.033 274.112 430.213 50 200 400 Mass / Charge
Protein Database Search Find the peptide sequences in a protein database that optimally fit the spectrum. • It does not work if the target peptide sequence is not in the database. • It does not work if there is an unknown modification at some amino acid. • It is very slow because it must search the entire database. • E.g., SEQUEST, Yates,Univ. of Washington.
De Novo Peptide Sequencing Problem • Input: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide. • Output: a peptide P such that • (1) mass(P)=W and • (2) S is a subset of all the ion masses of P. Peptide Mass 429.212 Daltons 100 P = SWR, Mass(P) = 429.212, Ions(P) = {88.033, 175.113, 274.112, 361.121, 430.213, 448.225} 361.121 274.112 Abundance (100%) 50 Mass / Charge
Tandem Mass Spectrum 100 175.113 361.121 448.225 Abundance (100%) 88.033 274.112 430.213 50 200 400 Mass / Charge Peptide Mass 429.212 Daltons
Feature 1 All B-ions form a forward mass ladder. 100 175.113 361.121 448.225 b1 b2 b3 Abundance (100%) 88.033 274.112 430.213 50 W S R 1 200 400 Mass / Charge Peptide Mass 429.212 Daltons
Feature 2 All Y-ions form a reverse mass ladder. 100 y1 y2 y3 175.113 361.121 448.225 R S W Abundance (100%) 88.033 274.112 430.213 50 W S R 200 400 19 Mass / Charge Peptide Mass 429.212 Daltons
Basic Difficulty #1 100 It is unknown whether an ion is a B-ion or an Y-ion. 175.113 361.121 448.225 Abundance (100%) 88.033 274.112 430.213 50 200 400 Mass / Charge Peptide Mass 429.212 Daltons
Basic Difficulty #2 There are missing ions. 100 361.121 Abundance (100%) 274.112 50 Ion 2 Ion 1 200 400 Mass / Charge Peptide Mass 429.212 Daltons
Feature 3 (to our Rescue) Complementary Ion Pairs: b1/y2 and b2/y1 100 y1 y2 y3 175.113 361.121 448.225 R S W Abundance (100%) b1 b2 b3 88.033 274.112 430.213 50 W S R 200 400 Mass / Charge Peptide Mass 429.212 Daltons
Outline of This Talk (2) • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions
Formulating the Computational Problem • T = an alphabet of 20 characters a1,a2,…,a20. • two special characters: alpha and beta. • the mass of alpha = 1, the mass of beta = 19, the mass of ai is mi. • A peptide sequence is x1,x2,x3,…,xn-1,xn,where each xi is from T. • A b-ion is x0,x1,x2,…,xi for some 1 <= i <= n, where x0 = alpha. • A y-ion is xi,…,xn-2,xn-1,xn,xn+1 for some 1 <= i <= n, where xn+1 = beta.
De Novo Peptide Sequencing Problem • Input: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide. • Output: a peptide P such that • (1) mass(P)=Wand • (2)Sis a subset of all the ion masses of P. Peptide Mass 429.212 Daltons 100 P = SWR, Mass(P) = 429.212, Ions(P) = {88.033, 175.113, 274.112, 361.121, 430.213, 448.225} 361.121 274.112 Abundance (100%) 50 Mass / Charge
Outline of This Talk (3) • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions
Basic Computing Scheme peptide mass W tandem mass spectrum S NC-spectrum graph Find feasible paths to order the masses in S to identify all the b-ions and y-ions consistent with S. Convert feasible paths into legal peptide sequences
NC-Spectrum Graph: Nodes (1) N0 C0 429.22 0 mass of this peptide
NC-Spectrum Graph: Nodes (2) Assumption 2: If Ion 1 is a b-ion N1: a b-ion node Assumption 1: If Ion 1 is an y-ion C1: a b-ion node Ion # 1 (274.11) N0 C1 N1 C0 174.11 273.11 0 429.22 mass of this peptide mass( ) + mass( ) = mass(P) + 18
NC-Spectrum Graph: Nodes (3) Ion # 2 (88.10) N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 mass( ) + mass( ) = mass(P) + 18
NC-Spectrum Graph: Edges (1) Mass(S) = 87.08. S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22
NC-Spectrum Graph: Edges (2) Mass(W) = 186.21 Mass(S) = 87.08. W S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22
NC-Spectrum Graph: Edges (3) Mass(W) = 186.21 Mass(S) = 87.08. W S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 S+W Mass(S+W) = 273.29
NC-Spectrum Graph: Edges (4) Mass(W) = 186.21 Mass(R) = 156.19 Mass(S) = 87.08. W R S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 S+W Mass(S+W) = 273.29
NC-Spectrum Graph N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22
NC-Spectrum Graph: Paths = Sequences W R S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 b-ions
NC-Spectrum Graph: A Feasible Path (1) b-ions a feasible path W R S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).
NC-Spectrum Graph: A Feasible Path (2) y-ions b-ions a feasible path S S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 GVV Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).
NC-Spectrum Graph: Not A Feasible Path (1) • not a feasible path: • miss ion #2 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).
NC-Spectrum Graph: Not A Feasible Path (2) not a feasible path: (2) repeat ion #1 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).
NC-Spectrum Graph: Not A Feasible Path (3) • not a feasible path: • miss ion #2 • repeat ion #1 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).
Reformulating the De Novo Peptide Sequencing Problem Input: an NC-spectrum graph G. Output: a feasible path from N0 to C0.
Observations • A longest path does not always go through exactly one of each pair of nodes. • It is an NP-hard problem if the spectrum graph is a general directed graph.
Basic Algorithm • Input: a peptide mass W and a tandem mass spectrum S. • Output: a feasible peptide sequence. • Steps: • Compute the nodes of the NC-spectrum graph G. • Compute the edges of G. • Compute a feasible path P in G. • Convert P into a feasible sequence.
Basic Algorithm (1) • Input: a peptide mass W and a tandem mass spectrum S. • Output: a feasible peptide sequence. • Steps: • Compute the nodes of the NC-spectrum graph G. • Compute the edges of G. • Compute a feasible path P in G. • Convert P into a feasible sequence.
Step 1. Compute the nodes and place them in the increasing order of masses. Compute the Nodes of the NC-Spectrum Graph N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Step 2. Rename the nodes from left to right as X0,…, Xk,Yk,…,Y0 X0 X1 X2 Y2 Y1 Y0 87.10 174.11 273.11 360.12 0 429.22 Observation: Xi and Yi form a complementary pair of nodes Ni and Ci for ion i. Running Time: O(k), where k = # of masses in the spectrum.
Basic Algorithm (2) • Input: a peptide mass W and a tandem mass spectrum S. • Output: a feasible peptide sequence. • Steps: • Compute the nodes of the NC-spectrum graph G. • Compute the edges of G. inverse of each other • Compute a feasible path P in G. • Convert P into a feasible sequence.