Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology

Sample + _ Detector Ionizer Mass Analyzer Mass Spectrometer ElectronMultiplier(EM) Time-Of-Flight (TOF) Quadrapole Ion-Trap MALDI Electro-SprayIonization (ESI)

Mass Spectrometer (MALDI-TOF) UV (337 nm) Microchannel plate detector Field-free drift zone Source Pulse voltage Analyte/matrix Ed = 0 Length = D Length = s Backing plate (grounded) Extraction grid (source voltage -Vs) Detector grid -Vs

Mass is fundamental

Enzymatic Digest and Fractionation Sample Preparation for MS/MS

Single Stage MS MS

Tandem Mass Spectrometry(MS/MS) Precursor selection

Tandem Mass Spectrometry(MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS

yn-i bi Peptide Fragmentation yn-i-1 -HN-CH-CO-NH-CH-CO-NH- CH-R’ Ri i+1 R” i+1 bi+1

Peptide Fragmentation S G F L E E D E L K 100 % Intensity 0 m/z 250 500 750 1000

Peptide Fragmentation 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 b3 b4 y2 y3 b5 y8 y4 b8 y9 b6 b7 b9 0 m/z 250 500 750 1000

MS/MS Search Engines • Fail when peptides are missing from sequence database • Protein sequence databases serve many masters • Full length protein sequences not needed for MS/MS • Explicit variant enumeration is needed for MS/MS • Much peptide sequence information is lost, inaccessible, or not integrated • Protein isoforms, sequence variants, SNPs,alternate splice forms, ESTs • Some peptides are more interesting than others • Protein identification is only part of the story

Human Sequences • Number of Human Genes is believed to be between 20,000 and 25,000

DNA to Protein Sequence Derived from http://online.itp.ucsb.edu/online/infobio01/burge

UCSC Genome Brower

Genomic Peptide Sequences • Many putative peptide sequences never become “protein” sequences • Genomic DNA, • Refseq mRNA, ESTs • SNP/Polymorphism databases • Variant records in SwissProt • Genomic annotation seeks “full length” genes and proteins

Genomic Peptide Sequences • Genomic DNA • Exons & introns, 6 frames, large (3Gb → 6Gb) • Refseq mRNA • No introns, 3 frames, small (36Mb → 36Mb) • Most protein sequences already represented in sequence databases • ESTs • No introns, 6 frames, large (3Gb → 6Gb) • Used by gene & alternative splicing pipelines • Highly redundant, nucleotide error rate ~ 1%

“Novel” Peptide

Novel peptide

EST Peptides • 6 frame translation • Ambiguous base enumeration (up to a point) • Break at non-amino-acids • stop codons + X • Discard AA sequence < 50 AA long • Result: ~ 3 Gb • Not as simple as it sounds!

EST Peptides • Lots of ambiguous bases • >gi|272208|gb|M61958.1| TGCACAACCAAGTTTTGTGACTACGGGAAGGCTCCCGGGGCAGAGGAGTACGCTCAACAAGATGTGTTAAAGAAATCTTACTCCAAGGCCTTCACGCTGACCATCTCTGCCCTCTTTGTGACACCCAAGACGACTGGGGCCCNGGTGGAGTTAAGCGAGCAGCAACTNCAGTTGTNGCCGAGTGATGTGGACAAGCTGTCACCCACTGACA

Codon Table

EST Peptides • Frame 1 translationCTTKFCDYGKAPGAEEYAQQDVLKKSYSKAFTLTISALFVTPKTTGA[QPRL]VELSEQQLQL[S*LW]PSDVDKLSPTD[IKMNSRT]

Correcting EST Sequence • Align ESTs to genome • Use aligned genomic sequence • Must get splice sites right! • 6 frame translation • Break at non-amino-acids • stop codons + X • Discard AA sequence < 50 AA long • Result: ~ 1 Gb

Genomic Coding Sequence • Use Genscan to predict exons • Use very low probability threshold • Alternative exons option • No need for translation (35Mb)

3 5 4* 4 1 1 1 1 2 2 2 2 5 4 4 5 4* 4* 5 3 4* 1 3 1 1 4 1 3 3-Frame Translation 30 AA C3 Compression Exon “Pair” Enumeration 2 3 4 5 1 Gene model 4* Exon 4 w/ SNP Exon Pairs & Paths Peptide Sequence

Peptide Candidates • Parent ion • Typically < 3000 Da • Tryptic Peptides • Cut at K or R • Search engines • Don’t handle > 4+ well • Long peptides don’t fragment well • # of distinct 30-mers upper bounds total peptide content

Sequence Database Compression Construct sequence database that is • Complete • All 30-mers are present • Correct • No other 30-mers are present • Compact • No 30-mer is present more than once

SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

Sequence Databases & CSBH-graphs • Sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

Sequence Databases & CSBH-graphs • All k-mers represented by an edge have the same count 1 2 2 1 2

Sequence Databases &CSBH-graphs • Complete • All edges are on some path • Correct • Output path sequence only • Compact • No edge is used more than once • C3 Path Set uses all edges exactly once.

Sequence Databases & CSBH-graphs • Use each edge exactly once ACDEFGEFGI, DEFACG

Sequence Databases & CSBH-graphs • All k-mers that occur at least twice 1 2 2 1 2 ACDEFGI

Relative Search Time SP UP IPI-H SP-VS UP-VS

More Sensitive Peptide ID • Significances, p-values, Expect values • Normalize for number of trials • Blast: • Size of sequence database • Mascot etc.: • Number of peptides scored against each spectrum • Redundant peptide sequences increase the number of trials, artificially. • Trials are not independent! • Less redundancy results in a better significance estimate

More Sensitive Peptide ID

Human Peptide Sequences • EST enumeration • 30-mers must occur at least twice • EST corrections • Genscan exons • Uncompressed size: ~ 4.5Gb • Compressed size: ~ 263Mb

Infrastructure • X!Tandem open source search engine • Configured to search aggressive peptide enumeration (human) • Web interface for browsing results • Integrated with condor • Results stored in MySQL database • Over 3 million publicly available MS/MS spectra from human samples

“Novel” Peptide

Ongoing work • Integrate SNPs and exon pairs • Get (lots) more spectra! • Solve the reverse mapping problem • Where did this peptide come from? • What protein does this peptide represent?

Thanks • Informatics Research @ ABI & Celera • Ross Lippert, Clark Mobarry, Bjarni Halldorsson • UMIACS @ University of Maryland, CP • V.S. Subrahmanian, Fritz McCall, Doan Pham • Fenselau Lab @ UM, CP • CS @ University of Maryland, CP • Chau-Wen Tseng, Xue Wu

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification