110 likes | 120 Views
Active Data Biology. Samuel Payne @ OmicsPNNL Pacific Northwest National Laboratory. Integrative Omics. Biology Repositories. Purpose: Hold data files and disseminate to public Mass Spectrometry Data Proteomics, Metabolomics PRIDE-EBI: 200TB Sequencing Data
E N D
Active Data Biology Samuel Payne @OmicsPNNL Pacific Northwest National Laboratory
Biology Repositories • Purpose: Hold data files and disseminate to public • Mass Spectrometry Data • Proteomics, Metabolomics • PRIDE-EBI: 200TB • Sequencing Data • Genomics, Transcriptomics, etc • SRA – 10^15 bases (through 2013). Discontinued because of space concerns • Imaging Data
Experience Sharing Data • ~500,000 mass spectrometry files • ~20-50x associated files • All data from 2000-2016 • 350 TB • Personal Web-server • List data with publications • Send upon request • Biodiversity Library • shared through ProteomeXchange • 13 TB (zipped). Data from 112 bacteria and archaea • 6 months of data sheparding to get transferred – 4 individuals • 70% of file downloads from public repository • Represents ~5% of our data.
Overcoming Big Data Raw Data Identification Hypotheses Browse & Share 1 2 3
Compliance is a losing venture • Sharing for compliance • Incomplete data • Incomplete meta-data • Low emotional investment • Sharing for collaboration • Invested in cooperation • Contains necessary and sufficient information • Better potential for reuse in general dissemination
Collaborative Infrastructure • Version Control Systems • Allow asynchronous work • ‘track changes’ and save all provenance • GitHub, Bitbucket, SVN, etc.
Active Data Biology • GitHub tracks • Data • Code • Insight • Collaboration
Acknowledgements • Joon-Yong Lee • Ryan Wilson, Gary Kiebel, Grant Fujimoto • Funding: • PNNL’s Laboratory Directed R&D funds • US Dept of Energy, Early Career Award