1 / 70

An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry

An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry. Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A. Collaborators of This Project. University of Southern California Ting Chen Harvard Medical School

makya
Download Presentation

An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A.

  2. Collaborators of This Project • University of Southern California • Ting Chen • Harvard Medical School • George M. Church • John Rush • Matthew Tepel

  3. Genome (DNA) Transcriptome (RNA) Proteome (Protein) Perspectives A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes. • Genome: entire sets of materials in the chromosomes. • Transcriptome: entire sets of gene transcripts. • Proteome: entire sets of proteins.

  4. Genome (DNA) Transcriptome (RNA) Proteome (Protein) Perspectives A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes. • Genome: entire sets of materials in the chromosomes. • Transcriptome: entire sets of gene transcripts. • Proteome: entire sets of proteins. this talk’s focus

  5. Proteomics • Proteome: all proteins encoded within a genome • half millions distinct proteins (temporal, spatial, modifications) • ~30,000 human genes • mRNA and protein expressions may not correlate • Proteomics:study of protein expression by biological systems • relative abundance and stability; post-translational modifications • fluctuations as a response to environment and altered cellular needs • correlations between protein expression and disease state • protein-protein interactions, protein complexes • Technologies: • 2D gel electrophoresis • mass spectrometry • yeast two-hybrid system • protein chips this talk’s focus

  6. A Key Step of Proteomics • How to sequence proteins? • How to sequence protein peptides? (this talk’s focus)

  7. Outline of This Talk • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions

  8. Outline of This Talk (1) • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions

  9. Protein Identification: HPLC-MS-MS Peptides Proteins B-ions / Y-ions One Peptide Mass/Charge Mass/Charge Tandem Mass Spectrum

  10. Protein Identification: HPLC-MS-MS Peptides Proteins B-ions / Y-ions One Peptide Mass/Charge Mass/Charge Tandem Mass Spectrum

  11. Peptide Fragmentation and Ionization B-ion Y-ion Complementary: Mass(B-ion)+Mass(Y-ion) = Mass(peptide)+4H+O

  12. B-ions and Y-ions Fragmentation

  13. Tandem Mass Spectrum 100 175.113 361.121 448.225 Abundance (100%) 88.033 274.112 430.213 50 200 400 Mass / Charge

  14. Raw Tandem Mass Spectrum

  15. Prediction from Raw Tandem Mass Spectrum

  16. Protein Database Search Find the peptide sequences in a protein database that optimally fit the spectrum. • It does not work if the target peptide sequence is not in the database. • It does not work if there is an unknown modification at some amino acid. • It is very slow because it must search the entire database. • E.g., SEQUEST, Yates,Univ. of Washington.

  17. De Novo Peptide Sequencing Problem • Input: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide. • Output: a peptide P such that • (1) mass(P)=W and • (2) S is a subset of all the ion masses of P. Peptide Mass 429.212 Daltons 100 P = SWR, Mass(P) = 429.212, Ions(P) = {88.033, 175.113, 274.112, 361.121, 430.213, 448.225} 361.121 274.112 Abundance (100%) 50 Mass / Charge

  18. Tandem Mass Spectrum 100 175.113 361.121 448.225 Abundance (100%) 88.033 274.112 430.213 50 200 400 Mass / Charge Peptide Mass 429.212 Daltons

  19. Amino Acid Mass Table

  20. Feature 1 All B-ions form a forward mass ladder. 100 175.113 361.121 448.225 b1 b2 b3 Abundance (100%) 88.033 274.112 430.213 50 W S R 1 200 400 Mass / Charge Peptide Mass 429.212 Daltons

  21. Feature 2 All Y-ions form a reverse mass ladder. 100 y1 y2 y3 175.113 361.121 448.225 R S W Abundance (100%) 88.033 274.112 430.213 50 W S R 200 400 19 Mass / Charge Peptide Mass 429.212 Daltons

  22. Basic Difficulty #1 100 It is unknown whether an ion is a B-ion or an Y-ion. 175.113 361.121 448.225 Abundance (100%) 88.033 274.112 430.213 50 200 400 Mass / Charge Peptide Mass 429.212 Daltons

  23. Basic Difficulty #2 There are missing ions. 100 361.121 Abundance (100%) 274.112 50 Ion 2 Ion 1 200 400 Mass / Charge Peptide Mass 429.212 Daltons

  24. Feature 3 (to our Rescue) Complementary Ion Pairs: b1/y2 and b2/y1 100 y1 y2 y3 175.113 361.121 448.225 R S W Abundance (100%) b1 b2 b3 88.033 274.112 430.213 50 W S R 200 400 Mass / Charge Peptide Mass 429.212 Daltons

  25. Outline of This Talk (2) • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions

  26. Formulating the Computational Problem • T = an alphabet of 20 characters a1,a2,…,a20. • two special characters: alpha and beta. • the mass of alpha = 1, the mass of beta = 19, the mass of ai is mi. • A peptide sequence is x1,x2,x3,…,xn-1,xn,where each xi is from T. • A b-ion is x0,x1,x2,…,xi for some 1 <= i <= n, where x0 = alpha. • A y-ion is xi,…,xn-2,xn-1,xn,xn+1 for some 1 <= i <= n, where xn+1 = beta.

  27. De Novo Peptide Sequencing Problem • Input: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide. • Output: a peptide P such that • (1) mass(P)=Wand • (2)Sis a subset of all the ion masses of P. Peptide Mass 429.212 Daltons 100 P = SWR, Mass(P) = 429.212, Ions(P) = {88.033, 175.113, 274.112, 361.121, 430.213, 448.225} 361.121 274.112 Abundance (100%) 50 Mass / Charge

  28. Amino Acid Mass Table

  29. Outline of This Talk (3) • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions

  30. Basic Computing Scheme peptide mass W tandem mass spectrum S NC-spectrum graph Find feasible paths to order the masses in S to identify all the b-ions and y-ions consistent with S. Convert feasible paths into legal peptide sequences

  31. NC-Spectrum Graph: Nodes (1) N0 C0 429.22 0 mass of this peptide

  32. NC-Spectrum Graph: Nodes (2) Assumption 2: If Ion 1 is a b-ion N1: a b-ion node Assumption 1: If Ion 1 is an y-ion C1: a b-ion node Ion # 1 (274.11) N0 C1 N1 C0 174.11 273.11 0 429.22 mass of this peptide mass( ) + mass( ) = mass(P) + 18

  33. NC-Spectrum Graph: Nodes (3) Ion # 2 (88.10) N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 mass( ) + mass( ) = mass(P) + 18

  34. NC-Spectrum Graph: Edges (1) Mass(S) = 87.08. S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22

  35. NC-Spectrum Graph: Edges (2) Mass(W) = 186.21 Mass(S) = 87.08. W S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22

  36. NC-Spectrum Graph: Edges (3) Mass(W) = 186.21 Mass(S) = 87.08. W S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 S+W Mass(S+W) = 273.29

  37. NC-Spectrum Graph: Edges (4) Mass(W) = 186.21 Mass(R) = 156.19 Mass(S) = 87.08. W R S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 S+W Mass(S+W) = 273.29

  38. NC-Spectrum Graph N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22

  39. NC-Spectrum Graph: Paths = Sequences W R S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 b-ions

  40. NC-Spectrum Graph: A Feasible Path (1) b-ions a feasible path W R S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

  41. NC-Spectrum Graph: A Feasible Path (2) y-ions b-ions a feasible path S S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 GVV Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

  42. NC-Spectrum Graph: Not A Feasible Path (1) • not a feasible path: • miss ion #2 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

  43. NC-Spectrum Graph: Not A Feasible Path (2) not a feasible path: (2) repeat ion #1 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

  44. NC-Spectrum Graph: Not A Feasible Path (3) • not a feasible path: • miss ion #2 • repeat ion #1 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

  45. Reformulating the De Novo Peptide Sequencing Problem Input: an NC-spectrum graph G. Output: a feasible path from N0 to C0.

  46. Observations • A longest path does not always go through exactly one of each pair of nodes. • It is an NP-hard problem if the spectrum graph is a general directed graph.

  47. Basic Algorithm • Input: a peptide mass W and a tandem mass spectrum S. • Output: a feasible peptide sequence. • Steps: • Compute the nodes of the NC-spectrum graph G. • Compute the edges of G. • Compute a feasible path P in G. • Convert P into a feasible sequence.

  48. Basic Algorithm (1) • Input: a peptide mass W and a tandem mass spectrum S. • Output: a feasible peptide sequence. • Steps: • Compute the nodes of the NC-spectrum graph G. • Compute the edges of G. • Compute a feasible path P in G. • Convert P into a feasible sequence.

  49. Step 1. Compute the nodes and place them in the increasing order of masses. Compute the Nodes of the NC-Spectrum Graph N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Step 2. Rename the nodes from left to right as X0,…, Xk,Yk,…,Y0 X0 X1 X2 Y2 Y1 Y0 87.10 174.11 273.11 360.12 0 429.22 Observation: Xi and Yi form a complementary pair of nodes Ni and Ci for ion i. Running Time: O(k), where k = # of masses in the spectrum.

  50. Basic Algorithm (2) • Input: a peptide mass W and a tandem mass spectrum S. • Output: a feasible peptide sequence. • Steps: • Compute the nodes of the NC-spectrum graph G. • Compute the edges of G. inverse of each other • Compute a feasible path P in G. • Convert P into a feasible sequence.

More Related