1 / 45

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification. Nathan Edwards Center for Bioinformatics and Computational Biology. Sample. +. _. Detector. Ionizer. Mass Analyzer. Mass Spectrometer. Electron Multiplier (EM). Time-Of-Flight (TOF) Quadrapole Ion-Trap.

rjackson
Download Presentation

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology

  2. Sample + _ Detector Ionizer Mass Analyzer Mass Spectrometer ElectronMultiplier(EM) Time-Of-Flight (TOF) Quadrapole Ion-Trap MALDI Electro-SprayIonization (ESI)

  3. Mass Spectrometer (MALDI-TOF) UV (337 nm) Microchannel plate detector Field-free drift zone Source Pulse voltage Analyte/matrix Ed = 0 Length = D Length = s Backing plate (grounded) Extraction grid (source voltage -Vs) Detector grid -Vs

  4. Mass is fundamental

  5. Enzymatic Digest and Fractionation Sample Preparation for MS/MS

  6. Single Stage MS MS

  7. Tandem Mass Spectrometry(MS/MS) Precursor selection

  8. Tandem Mass Spectrometry(MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS

  9. yn-i bi Peptide Fragmentation yn-i-1 -HN-CH-CO-NH-CH-CO-NH- CH-R’ Ri i+1 R” i+1 bi+1

  10. Peptide Fragmentation S G F L E E D E L K 100 % Intensity 0 m/z 250 500 750 1000

  11. Peptide Fragmentation 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 b3 b4 y2 y3 b5 y8 y4 b8 y9 b6 b7 b9 0 m/z 250 500 750 1000

  12. MS/MS Search Engines • Fail when peptides are missing from sequence database • Protein sequence databases serve many masters • Full length protein sequences not needed for MS/MS • Explicit variant enumeration is needed for MS/MS • Much peptide sequence information is lost, inaccessible, or not integrated • Protein isoforms, sequence variants, SNPs,alternate splice forms, ESTs • Some peptides are more interesting than others • Protein identification is only part of the story

  13. Human Sequences • Number of Human Genes is believed to be between 20,000 and 25,000

  14. DNA to Protein Sequence Derived from http://online.itp.ucsb.edu/online/infobio01/burge

  15. UCSC Genome Brower

  16. Genomic Peptide Sequences • Many putative peptide sequences never become “protein” sequences • Genomic DNA, • Refseq mRNA, ESTs • SNP/Polymorphism databases • Variant records in SwissProt • Genomic annotation seeks “full length” genes and proteins

  17. Genomic Peptide Sequences • Genomic DNA • Exons & introns, 6 frames, large (3Gb → 6Gb) • Refseq mRNA • No introns, 3 frames, small (36Mb → 36Mb) • Most protein sequences already represented in sequence databases • ESTs • No introns, 6 frames, large (3Gb → 6Gb) • Used by gene & alternative splicing pipelines • Highly redundant, nucleotide error rate ~ 1%

  18. “Novel” Peptide

  19. Novel peptide

  20. EST Peptides • 6 frame translation • Ambiguous base enumeration (up to a point) • Break at non-amino-acids • stop codons + X • Discard AA sequence < 50 AA long • Result: ~ 3 Gb • Not as simple as it sounds!

  21. EST Peptides • Lots of ambiguous bases • >gi|272208|gb|M61958.1| TGCACAACCAAGTTTTGTGACTACGGGAAGGCTCCCGGGGCAGAGGAGTACGCTCAACAAGATGTGTTAAAGAAATCTTACTCCAAGGCCTTCACGCTGACCATCTCTGCCCTCTTTGTGACACCCAAGACGACTGGGGCCCNGGTGGAGTTAAGCGAGCAGCAACTNCAGTTGTNGCCGAGTGATGTGGACAAGCTGTCACCCACTGACA

  22. Codon Table

  23. EST Peptides • Frame 1 translationCTTKFCDYGKAPGAEEYAQQDVLKKSYSKAFTLTISALFVTPKTTGA[QPRL]VELSEQQLQL[S*LW]PSDVDKLSPTD[IKMNSRT]

  24. Correcting EST Sequence • Align ESTs to genome • Use aligned genomic sequence • Must get splice sites right! • 6 frame translation • Break at non-amino-acids • stop codons + X • Discard AA sequence < 50 AA long • Result: ~ 1 Gb

  25. Genomic Coding Sequence • Use Genscan to predict exons • Use very low probability threshold • Alternative exons option • No need for translation (35Mb)

  26. 3 5 4* 4 1 1 1 1 2 2 2 2 5 4 4 5 4* 4* 5 3 4* 1 3 1 1 4 1 3 3-Frame Translation 30 AA C3 Compression Exon “Pair” Enumeration 2 3 4 5 1 Gene model 4* Exon 4 w/ SNP Exon Pairs & Paths Peptide Sequence

  27. Peptide Candidates • Parent ion • Typically < 3000 Da • Tryptic Peptides • Cut at K or R • Search engines • Don’t handle > 4+ well • Long peptides don’t fragment well • # of distinct 30-mers upper bounds total peptide content

  28. Sequence Database Compression Construct sequence database that is • Complete • All 30-mers are present • Correct • No other 30-mers are present • Compact • No 30-mer is present more than once

  29. SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

  30. Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

  31. Sequence Databases & CSBH-graphs • Sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

  32. Sequence Databases & CSBH-graphs • All k-mers represented by an edge have the same count 1 2 2 1 2

  33. Sequence Databases &CSBH-graphs • Complete • All edges are on some path • Correct • Output path sequence only • Compact • No edge is used more than once • C3 Path Set uses all edges exactly once.

  34. Sequence Databases & CSBH-graphs • Use each edge exactly once ACDEFGEFGI, DEFACG

  35. Sequence Databases & CSBH-graphs • All k-mers that occur at least twice 1 2 2 1 2 ACDEFGI

  36. Relative Search Time SP UP IPI-H SP-VS UP-VS

  37. More Sensitive Peptide ID • Significances, p-values, Expect values • Normalize for number of trials • Blast: • Size of sequence database • Mascot etc.: • Number of peptides scored against each spectrum • Redundant peptide sequences increase the number of trials, artificially. • Trials are not independent! • Less redundancy results in a better significance estimate

  38. More Sensitive Peptide ID

  39. Human Peptide Sequences • EST enumeration • 30-mers must occur at least twice • EST corrections • Genscan exons • Uncompressed size: ~ 4.5Gb • Compressed size: ~ 263Mb

  40. Infrastructure • X!Tandem open source search engine • Configured to search aggressive peptide enumeration (human) • Web interface for browsing results • Integrated with condor • Results stored in MySQL database • Over 3 million publicly available MS/MS spectra from human samples

  41. “Novel” Peptide

  42. “Novel” Peptide

  43. “Novel” Peptide

  44. Ongoing work • Integrate SNPs and exon pairs • Get (lots) more spectra! • Solve the reverse mapping problem • Where did this peptide come from? • What protein does this peptide represent?

  45. Thanks • Informatics Research @ ABI & Celera • Ross Lippert, Clark Mobarry, Bjarni Halldorsson • UMIACS @ University of Maryland, CP • V.S. Subrahmanian, Fritz McCall, Doan Pham • Fenselau Lab @ UM, CP • CS @ University of Maryland, CP • Chau-Wen Tseng, Xue Wu

More Related