1 / 12

Big Data & the CPTAC Data Portal

Big Data & the CPTAC Data Portal. Nathan Edwards, Peter McGarvey Mauricio Oberti , Ratna Thangudu Shuang Cai , Karen Ketchum Georgetown University & ESAC Nathan Edwards Georgetown University Medical Center. NCI: CPTAC. Clinical Proteomic Tumor Analysis Consortium (CPTAC)

jerry
Download Presentation

Big Data & the CPTAC Data Portal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data & the CPTACData Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, RatnaThangudu ShuangCai, Karen Ketchum Georgetown University & ESAC Nathan Edwards Georgetown University Medical Center

  2. NCI: CPTAC • Clinical Proteomic Tumor Analysis Consortium (CPTAC) • Comprehensive study of genomically characterized (TCGA) cancer biospecimens by bottom-up mass-spectrometry-based proteomics workflows • Follows Clinical Proteomics Technology Assessment Consortium (CPTAC Phase I)

  3. NCI: CPTAC

  4. CPTAC Data Portal • All data is publicly released… • …subject to responsible use guidelines • Consortium has 15 months to publish first global analysis • Data available in the meantime. http://grg.tn/cptac

  5. Proteomics Workflows • Modern Instrumentation: • Orbitrap, Q-Exactive, AB 5600 • Protein Enrichment: • Phosphoproteins, Glycoproteins • Quantitation: • Label-free, precursor area or spectral count; or iTRAQ • Peptide Fractionation: • Deep sampling of less abundant peptides

  6. Available Data • Mass Spectrometry Data • Raw and mzML formats • Experimental Design Meta-Data • Link to TCGA, clinical context • Analytical Protocol Documents • Sample prep, chromatography, MS • Peptide-Spectrum-Match Data • CPTAC Common analysis pipeline (NIST) • MS-GF+ based, TSV and mzIdentML formats • Gene inference and quantitation

  7. CPTAC/TCGA Colorectal Cancer (Proteome) • Vanderbilt PCC (PI: Liebler), Embargo: 12/2014 • 95 TCGA samples, 15 fractions / sample • Label-free spectral count / precursor XIC quant. • OrbitrapVelos; high-accuracy precursor • 1425 spectra files ~ 600 Gb / ~ 129 Gb (mzML.gz) • Spectra: ~ 18M; ~ 13M MS/MS • 4,644,354 PSMs at 1% MSGF+ q-value • 10,258 genes at 0.01% gene FDR, 9047 groups

  8. CPTAC/TCGA Breast Cancer (Proteome) • Broad PCC (PI: Carr), Embargo: 5/2015 • 108 TCGA samples, 25 fractions / sample-mixture • Proteome; iTRAQ quantitation; 3 samples vs POOL • Q-Exactive; high-accuracy precursor • 900 spectra files ~ 1Tb / ~ 280 Gb (mzML.gz) • Spectra: ~ 41M; ~ 32M MS/MS • 13,764,193 PSMs at 1% MSGF+ q-value • 13,716 genes at 0.01% gene FDR, 10,007 groups

  9. CPTAC/TCGA Breast Cancer (Phosphoproteome) • Broad PCC (PI: Carr), Embargo: 5/2015 • 108 TCGA samples, 13 fractions / sample-mixture • IMAC enriched; iTRAQquant.; 3 samp. vs POOL • Q-Exactive; high-accuracy precursor • 468 spectra files ~ 600 Gb / ~ 130 Gb (mzML.gz) • Spectra: ~ 16M; ~ 10M MS/MS • 3,355,721 PSMs at 1% MSGF+ q-value • 10,352 genes at 0.01% gene FDR, 8875 groups

  10. CPTAC Data Center Lessons • Files on disk are "easy" • Meta-data, experimental design, semantics HARD • File naming conventions seem trivial but do it • Backup, access, redundancy is IT and costs $$ • Advanced network transfer tools really work! • Aspera provides order of magnitude improvement • Scriptable upload/download/navigation matters! • (Spectra) file integrity is really important • Platform agnostic chain of custody from lab • mzML conversion verifies RAW file semantics • mzML embeds checksums, platform agnostic • mzML semantic compression (peaks only)

  11. CPTAC TCGA Data Lessons • Monolithic computation no longer sufficient! • Many datafiles, distributed computation, out-of-core • PSMs are the new RAW data? (~ NGS reads) • Many PSMs / gene; # Spectra >> # Sequences! • "Poor" acquisitions are not uncommon • Need fast, easy QC to permit re-analysis • Other issues: • Is identifiability information leaking (germline mutations)? • Protein inference for human/mouse xenograftspectra? • How to really handle isoforms? • Proteome coverage – how to estimate?

  12. Heresy: PSMs as NGS reads • Need O(n) spectra → good PSMs • We work too hard to identify all spectra, too stringent? • Progressive, pareto, PTAS identification? • Output as genome alignments, BAM files? • Volume dominates noise and loss of detail: • e.g. Twitter; indirect observation of splicing, PTMs? • Models of distributed computation • Distributed data and/or computation • Failure, interruption tolerant computing • Heterogeneous computing resources • PSM search engine API for mining (social, reward?)

More Related