1 / 31

Improving the Sensitivity of Peptide Identification

Improving the Sensitivity of Peptide Identification. by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center. Searching under the street-light….

Download Presentation

Improving the Sensitivity of Peptide Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving the Sensitivityof Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center

  2. Searching under the street-light… • Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! • Searching traditional protein sequence databases biases the results in favor ofwell-understoodand/orcomputationally predicted proteins and protein isoforms!

  3. Lost peptide identifications • Missing from the sequence database • Search engine strengths, weaknesses, quirks • Poor score or statistical significance • Thorough search takes too long

  4. Lost peptide identifications • Missing from the sequence database • Build exhaustive peptide sequence databases • Build evidence for unannotated proteins and protein isoforms • Search engine strengths, weaknesses, quirks • Use multiple search engines and combine results • Poor score or statistical significance • Use search-engine consensus to boost confidence • Use machine-learning to distinguish true from false • Thorough search takes too long • Harness the power of heterogeneous computational grids

  5. Unannotated Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • Peptide Atlas raftflow, raftapr, raftaug • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications

  6. Unannotated Splice Isoform

  7. Unannotated Splice Isoform

  8. Splice Isoform Anomaly • Human erythroleukemia K562 cell-line • Depth of coverage study • Resing et al. Anal. Chem. 2004. • Peptide Atlas A8_IP • SALT1A2 gene: • Sulfotransferase family, cytosolic, 1A • 2 ESTs, 1 mRNA • mRNA from lung, small cell-cancinoma sample • Single (significant) peptide identification • Five agreeing search engines • PepArML FDR < 1%. • All source engines have non-significant E-values

  9. Splice Isoform Anomaly

  10. Splice Isoform Anomaly

  11. Peptide Sequence Databases All amino-acid seqs of at most 30 amino-acids from: • IPI and all IPI constituent protein sequences • IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank • SwissProt variants, conflicts, splices, and annotated signal peptide truncations. • Genbank and RefSeq mRNA sequence • 3 frame translation • GenBank EST and HTC sequences • 6 frame translation and found in at least 2 sequences Grouped by Gene/UniGene cluster and compressed.

  12. Peptide Sequence Databases • Formatted as a FASTA sequence database • Easy integration with search engines. • One entry per gene/cluster. • Automated rebuild every few months.

  13. Peptide evidence, in context • Statistically significant identified peptides can be misleading… • Isobaric amino-acid/PTM substitutions • Unsubstantiated peptide termini • Few b-ions or y-ions suggest “random” mass match • Single amino-acids on upstream or downstream exons • Peptides in 5’ UTR with no upstream Met • Need tools to quickly check the corroborating (genomic, transcript, SNP) evidence

  14. Counts: by gene and evidence EST, mRNA, Protein Sequences: accessions by gene UniProt variants nucleotide sequence & link to BLAT alignment Genomic Loci: one-click projection onto the UCSC genome browser peptides with cSNPs too! PeptideMapper Web Service

  15. PeptideMapper Web Service I’m Feeling Lucky

  16. PeptideMapper Web Service I’m Feeling Lucky

  17. Combining search engine results – harder than it looks! • Consensus boosts confidence, but... • How to assess statistical significance? • Gain specificity, but lose sensitivity! • Incorrect identifications are correlated too! • How to handle weak identifications? • Consensus vs disagreement vs abstention • Threshold at some significance? • We apply unsupervised machine-learning.... • Lots of related work unified in a single framework.

  18. PepArML – Peptide identification Arbiter by Machine-Learning

  19. Peptide Atlas A8_IP LTQ Dataset

  20. Peptide Atlas Halobacterium Dataset

  21. Running many search engines Search engine configuration can be difficult: • Correct spectral format • Search parameter files and command-line • Pre-processed sequence databases. • Tracking spectrum identifiers • Extracting peptide identifications, especially modifications and protein identifiers

  22. Instrument Precursor Tolerance Fragment Tolerance Max. Charge Sequence Database Target and # of Decoys Modification Fixed/Variable Amino-Acids Position Delta Proteolytic Agent Motif Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13C Peaks Search Engines Mascot, X!Tandem, K-Score, OMSSA, MyriMatch Peptide Identification Meta-Search Parameters

  23. Simple unified search interface for: Mascot, X!Tandem, K-Score, OMSSA, MyriMatch Automatic decoy searches Automatic spectrumfile "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid Peptide Identification Meta-Search

  24. PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 48+ CPUs Scales easily to 250+ simultaneoussearches X!Tandem, KScore, OMSSA. Single, simplesearch request UMIACS 250+ CPUs

  25. PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 48+ CPUs Scales easily to 250+ simultaneoussearches X!Tandem, KScore, OMSSA. Single, simplesearch request UMIACS 250+ CPUs

  26. PepArML Meta-Search Engine Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple searchrequest UMIACS 250+ CPUs

  27. PepArML Meta-Search Engine Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple searchrequest UMIACS 250+ CPUs

  28. Peptide Atlas A8_IP LTQ Dataset • Tryptic search of Human ESTs using PepSeqDB • 107084 spectra (145 files) searched ~ 26 times: • Target + 2 decoys, 5 engines, 1+ vs 2+/3+ charge • 8685 search jobs • 25.7 days of CPU time. • 5211 TeraGrid TKO jobs < 2 hours • Using 143 different machines • Total elapsed time < 26 hours • Bottleneck: Mascot license (1 core, 4 CPUs)

  29. PepArML Meta-Search Engine • Access to high-performance computing resources for the proteomics community • NSF TeraGrid Community Portal • University/Institute HPC clusters • Individual lab compute resources • Contribute cycles to the community and get access to others’ cycles in return. • Centralized scheduler • Compute capacity can still be exclusive, or prioritized. • Compute client plays well with HPC grid schedulers.

  30. Conclusions Improve sensitivity of peptide identification, using • Exhaustive peptide sequence databases • Machine-learning for combining • Meta-search tools to maximize consensus • Grid-computing for thorough search Tools & cycles available to the community... http://edwardslab.bmcb.georgetown.edu

  31. Acknowledgements • Dr. Catherine Fenselau • University of Maryland Biochemistry • Dr. Rado Goldman • Georgetown University Medical Center • Dr. Chau-Wen Tseng & Dr. Xue Wu • University of Maryland Computer Science • Funding: NIH/NCI

More Related