Big Data & the CPTAC Data Portal

Big Data & the CPTACData Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, RatnaThangudu ShuangCai, Karen Ketchum Georgetown University & ESAC Nathan Edwards Georgetown University Medical Center

NCI: CPTAC • Clinical Proteomic Tumor Analysis Consortium (CPTAC) • Comprehensive study of genomically characterized (TCGA) cancer biospecimens by bottom-up mass-spectrometry-based proteomics workflows • Follows Clinical Proteomics Technology Assessment Consortium (CPTAC Phase I)

NCI: CPTAC

CPTAC Data Portal • All data is publicly released… • …subject to responsible use guidelines • Consortium has 15 months to publish first global analysis • Data available in the meantime. http://grg.tn/cptac

Proteomics Workflows • Modern Instrumentation: • Orbitrap, Q-Exactive, AB 5600 • Protein Enrichment: • Phosphoproteins, Glycoproteins • Quantitation: • Label-free, precursor area or spectral count; or iTRAQ • Peptide Fractionation: • Deep sampling of less abundant peptides

Available Data • Mass Spectrometry Data • Raw and mzML formats • Experimental Design Meta-Data • Link to TCGA, clinical context • Analytical Protocol Documents • Sample prep, chromatography, MS • Peptide-Spectrum-Match Data • CPTAC Common analysis pipeline (NIST) • MS-GF+ based, TSV and mzIdentML formats • Gene inference and quantitation

CPTAC/TCGA Colorectal Cancer (Proteome) • Vanderbilt PCC (PI: Liebler), Embargo: 12/2014 • 95 TCGA samples, 15 fractions / sample • Label-free spectral count / precursor XIC quant. • OrbitrapVelos; high-accuracy precursor • 1425 spectra files ~ 600 Gb / ~ 129 Gb (mzML.gz) • Spectra: ~ 18M; ~ 13M MS/MS • 4,644,354 PSMs at 1% MSGF+ q-value • 10,258 genes at 0.01% gene FDR, 9047 groups

CPTAC/TCGA Breast Cancer (Proteome) • Broad PCC (PI: Carr), Embargo: 5/2015 • 108 TCGA samples, 25 fractions / sample-mixture • Proteome; iTRAQ quantitation; 3 samples vs POOL • Q-Exactive; high-accuracy precursor • 900 spectra files ~ 1Tb / ~ 280 Gb (mzML.gz) • Spectra: ~ 41M; ~ 32M MS/MS • 13,764,193 PSMs at 1% MSGF+ q-value • 13,716 genes at 0.01% gene FDR, 10,007 groups

CPTAC/TCGA Breast Cancer (Phosphoproteome) • Broad PCC (PI: Carr), Embargo: 5/2015 • 108 TCGA samples, 13 fractions / sample-mixture • IMAC enriched; iTRAQquant.; 3 samp. vs POOL • Q-Exactive; high-accuracy precursor • 468 spectra files ~ 600 Gb / ~ 130 Gb (mzML.gz) • Spectra: ~ 16M; ~ 10M MS/MS • 3,355,721 PSMs at 1% MSGF+ q-value • 10,352 genes at 0.01% gene FDR, 8875 groups

CPTAC Data Center Lessons • Files on disk are "easy" • Meta-data, experimental design, semantics HARD • File naming conventions seem trivial but do it • Backup, access, redundancy is IT and costs $$ • Advanced network transfer tools really work! • Aspera provides order of magnitude improvement • Scriptable upload/download/navigation matters! • (Spectra) file integrity is really important • Platform agnostic chain of custody from lab • mzML conversion verifies RAW file semantics • mzML embeds checksums, platform agnostic • mzML semantic compression (peaks only)

CPTAC TCGA Data Lessons • Monolithic computation no longer sufficient! • Many datafiles, distributed computation, out-of-core • PSMs are the new RAW data? (~ NGS reads) • Many PSMs / gene; # Spectra >> # Sequences! • "Poor" acquisitions are not uncommon • Need fast, easy QC to permit re-analysis • Other issues: • Is identifiability information leaking (germline mutations)? • Protein inference for human/mouse xenograftspectra? • How to really handle isoforms? • Proteome coverage – how to estimate?

Heresy: PSMs as NGS reads • Need O(n) spectra → good PSMs • We work too hard to identify all spectra, too stringent? • Progressive, pareto, PTAS identification? • Output as genome alignments, BAM files? • Volume dominates noise and loss of detail: • e.g. Twitter; indirect observation of splicing, PTMs? • Models of distributed computation • Distributed data and/or computation • Failure, interruption tolerant computing • Heterogeneous computing resources • PSM search engine API for mining (social, reward?)

Big Data & the CPTAC Data Portal