Active Data Biology

Active Data Biology Samuel Payne @OmicsPNNL Pacific Northwest National Laboratory

Integrative Omics

Biology Repositories • Purpose: Hold data files and disseminate to public • Mass Spectrometry Data • Proteomics, Metabolomics • PRIDE-EBI: 200TB • Sequencing Data • Genomics, Transcriptomics, etc • SRA – 10^15 bases (through 2013). Discontinued because of space concerns • Imaging Data

Data Growth- EBI

Experience Sharing Data • ~500,000 mass spectrometry files • ~20-50x associated files • All data from 2000-2016 • 350 TB • Personal Web-server • List data with publications • Send upon request • Biodiversity Library • shared through ProteomeXchange • 13 TB (zipped). Data from 112 bacteria and archaea • 6 months of data sheparding to get transferred – 4 individuals • 70% of file downloads from public repository • Represents ~5% of our data.

Overcoming Big Data Raw Data Identification Hypotheses Browse & Share 1 2 3

Compliance is a losing venture • Sharing for compliance • Incomplete data • Incomplete meta-data • Low emotional investment • Sharing for collaboration • Invested in cooperation • Contains necessary and sufficient information • Better potential for reuse in general dissemination

Collaborative Infrastructure • Version Control Systems • Allow asynchronous work • ‘track changes’ and save all provenance • GitHub, Bitbucket, SVN, etc.

Active Data Biology • GitHub tracks • Data • Code • Insight • Collaboration

Total Transparency

Acknowledgements • Joon-Yong Lee • Ryan Wilson, Gary Kiebel, Grant Fujimoto • Funding: • PNNL’s Laboratory Directed R&D funds • US Dept of Energy, Early Career Award

Active Data Biology