1 / 49

Improving the Sensitivity of Peptide Identification for Genome Annotation

Improving the Sensitivity of Peptide Identification for Genome Annotation. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Why Tandem Mass Spectrometry?.

harken
Download Presentation

Improving the Sensitivity of Peptide Identification for Genome Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving the Sensitivityof Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

  2. Why Tandem Mass Spectrometry? • MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. • Key concepts: • Spectrum acquisition is unbiased • Direct observation of amino-acid sequence • Sensitive to small sequence variations

  3. Sample + _ Detector Ionizer Mass Analyzer Mass Spectrometer ElectronMultiplier(EM) Time-Of-Flight (TOF) Quadrapole Ion-Trap MALDI Electro-SprayIonization (ESI)

  4. Mass Spectrum

  5. Mass is fundamental

  6. Enzymatic Digest and Fractionation Sample Preparation for MS/MS

  7. Single Stage MS MS

  8. Tandem Mass Spectrometry(MS/MS) Precursor selection

  9. Tandem Mass Spectrometry(MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS

  10. Unannotated Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications

  11. Unannotated Splice Isoform

  12. Unannotated Splice Isoform

  13. Splice Isoform Anomaly • Human erythroleukemia K562 cell-line • Depth of coverage study • Resing et al. Anal. Chem. 2004. • Peptide Atlas A8_IP • SALT1A2 gene: • Sulfotransferase family, cytosolic, 1A • 2 ESTs, 1 mRNA • mRNA from lung, small cell-cancinoma sample • Single (significant) peptide identification • Five agreeing search engines • PepArML FDR < 1%. • All source engines have non-significant E-values

  14. Splice Isoform Anomaly

  15. Splice Isoform Anomaly

  16. Translation start-site correction • Halobacterium sp. NRC-1 • Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins • Goo, et al. MCP 2003. • GdhA1 gene: • Glutamate dehydrogenase A1 • Multiple significant peptide identifications • Observed start is consistent with Glimmer 3.0 prediction(s)

  17. Halobacterium sp. NRC-1ORF: GdhA1 • K-score E-value vs PepArML @ 10% FDR • Many peptides inconsistent with annotated translation start site of NP_279651

  18. Translation start-site correction

  19. Phyloproteomics • Tandem mass-spectra of proteins (top-down) • High-accuracy instrument (Orbitrap, UMD Core) • Proteins from unsequenced bacteria matching identical proteins in related organisms • Demonstration using Y.rohdei.

  20. Phyloproteomics

  21. Phyloproteomics Protein Sequence 16S-rRNA Sequence phylogeny.fr – "One-Click"

  22. Shared "Biomarker" Proteins

  23. Phyloproteomics • Recent extension to highly homologous proteins in related organisms • Merely require N- and/or C-terminus in common • Broadens applicability considerably • Phyloproteomic trees for E.herbicola and Enterocloacae, neither sequenced. • New paradigm for phylogenetic analysis?

  24. Lost peptide identifications • Missing from the sequence database • Search engine strengths, weaknesses, quirks • Poor score or statistical significance • Thorough search takes too long

  25. Searching under the street-light… • Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! • Searching traditional protein sequence databases biases the results in favor ofwell-understoodand/orcomputationally predicted proteins and protein isoforms!

  26. Peptide Sequence Databases • All amino-acid 30-mers, no redundancy • From ESTs, Proteins, mRNAs • 30-40 fold size, search time reduction • Formatted as a FASTA sequence database • One entry per gene/cluster.

  27. We can observe evidence for… • Known coding SNPs • Unannotated coding mutations • Alternate splicing isoforms • Alternate/Incorrect translation start-sites • Microexons • Alternate/Incorrect translation frames …though it must be treated thoughtfully.

  28. PeptideMapper Web Service I’m Feeling Lucky

  29. PeptideMapper Web Service I’m Feeling Lucky

  30. PeptideMapper Web Service I’m Feeling Lucky

  31. PeptideMapper Web Service • Suffix-tree index on peptide sequence database • Fast peptide to gene/cluster mapping • “Compression” makes this feasible • Peptide alignment with cluster evidence • Amino-acid or nucleotide; exact & near-exact • Genomic-loci mapping via • UCSC “known-gene” transcripts, and • Predetermined, embedded genomic coordinates

  32. SEQUEST Mascot 28% 14% 14% 38% 1% 3% 2% X! Tandem Comparison of search engine results • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment Searle et al. JPR 7(1), 2008

  33. Combining search engine results – harder than it looks! • Consensus boosts confidence, but... • How to assess statistical significance? • Gain specificity, but lose sensitivity! • Incorrect identifications are correlated too! • How to handle weak identifications? • Consensus vs disagreement vs abstention • Threshold at some significance? • We apply unsupervised machine-learning.... • Lots of related work unified in a single framework.

  34. Supervised Learning

  35. Unsupervised Learning

  36. Peptide Atlas A8_IP LTQ Dataset

  37. Running many search engines Search engine configuration can be difficult: • Correct spectral format • Search parameter files and command-line • Pre-processed sequence databases. • Tracking spectrum identifiers • Extracting peptide identifications, especially modifications and protein identifiers

  38. Simple unified search interface for: Mascot, X!Tandem, K-Score, OMSSA, MyriMatch Automatic decoy searches Automatic spectrumfile "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid Peptide Identification Meta-Search

  39. PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 48+ CPUs Scales easily to 250+ simultaneoussearches Single, simplesearch request UMIACS 250+ CPUs

  40. PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 80+ CPUs Scales easily to 250+ simultaneoussearches Single, simplesearch request

  41. PepArML Meta-Search Engine Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple searchrequest UMIACS 250+ CPUs

  42. PepArML Meta-Search Engine Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple searchrequest UMIACS 250+ CPUs

  43. Peptide Identification Grid-Enabled Meta-Search • Access to high-performance computing resources for the proteomics community • NSF TeraGrid Community Portal • University/Institute HPC clusters • Individual lab compute resources • Contribute cycles to the community and get access to others’ cycles in return. • Centralized scheduler • Compute capacity can still be exclusive, or prioritized. • Compute client plays well with HPC grid schedulers.

  44. PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). Edwards Lab Scheduler & 80+ CPUs NSF TeraGrid 1000+ CPUs X!Tandem, KScore, OMSSA, MyriMatch. X!Tandem, KScore, OMSSA. UMIACS 250+ CPUs

  45. PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). Edwards Lab Scheduler & 80+ CPUs NSF TeraGrid 1000+ CPUs X!Tandem, KScore, OMSSA, MyriMatch. X!Tandem, KScore, OMSSA. UMIACS 250+ CPUs

  46. PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). Edwards Lab Scheduler & 80+ CPUs NSF TeraGrid 1000+ CPUs X!Tandem, KScore, OMSSA, MyriMatch. X!Tandem, KScore, OMSSA. UMIACS 250+ CPUs UMD Proteomics Core Scheduler & 2 CPUs

  47. PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). Edwards Lab Scheduler & 80+ CPUs NSF TeraGrid 1000+ CPUs X!Tandem, KScore, OMSSA, MyriMatch. X!Tandem, KScore, OMSSA. UMIACS 250+ CPUs UMD Proteomics Core Scheduler & 2 CPUs

  48. Conclusions Improve the scope and sensitivity of peptide identification for genome annotation, using • Exhaustive peptide sequence databases • Machine-learning for combining • Meta-search tools to maximize consensus • Grid-computing for thorough search http://edwardslab.bmcb.georgetown.edu

  49. Acknowledgements • Dr. Catherine Fenselau & students • University of Maryland Biochemistry • Dr. Yan Wang • University of Maryland Proteomics Core • Dr. Art Delcher • University of Maryland CBCB • Dr. Chau-Wen Tseng & Dr. Xue Wu • University of Maryland Computer Science • Funding: NIH/NCI

More Related