Revolutionizing the Journal through Big Data Computational Research

Revolutionizing the Journal through Big Data Computational Research DataCite Annual ConferenceInist-CNRS Vandoeuvre-lès-Nancy, France 26 August 2014 Amye KenallJournal Development Manager, Open Data

Who are we? Founded in 2000 (bought by Springer in 2008) Publish over 260 open access journals ~25,000 peer reviewed research articles published annually Genomics and computational biology are a significant fraction e.g. Genome Biology, BMC Genomics, BMC Bioinformatics Other key fields include • Public Health / Global Health / Infectious Disease • Cancer All research articles are CC-BY licensed for reuse Since mid 2013, all data is covered by a CC0 rights waiver

Data reuse @BioMedCentral • Availability of Data section and Data Citation • Encourage use of ISA-TAB (especially GigaScienceand BMC Research Notes) Strong encouragement to authors of all journals to provide underlying datasets and required on a select number (eg.Genome Biology, Genome Medicine, GigaScience) CC0 + CC-BY 4.0 by default In the works… Interactive tabular data DOIs for all additional files Searchability of additional files Data Citation clearly tagged in XML to aid harvesting e.g. Data Citation Index

Journal, data-platform and database for large- scale data In conjunction with

Linking and Citation

Publishing Reproducible Science: SOAPdenovo2, a case study

Lessons Learned? • With enough work, results can be replicated with a push of a button. • But a lot of work costs a lot of money! No one would pay an APC that reflects that cost. • Learn a huge amount about the study and provides a lot of information not present in the paper. • Needs to happen before publication.

Reproducibility of computational research This means readers and even reviewers don’t bother We would like to reduce this ‘activation energy’ Computational research in principle should be easier to replicate/reproduce than bench studies However, practical issues get in the way Even if source code is shared, reproducing entire technical setup/porting software, gathering appropriate input data, rerunning analysis is a significant effort

Strong interest from potential partners

Key technologies

+ + Journal Article Technologies Partners

Flexible management/deployment of packaged data/analysis suites using VM infrastructure

Complementary roles of publishers, academia, and cloud providers Publishershave role in enforcement of community standards Public/academic databases can provide credible long term archiving for key data with a focus on curation and metadata standards Academic grid computing infrastructure can provide access for researchers to large-scale computing resource Commercial cloud providers universalize/democratize access to large-scale computing. Even if you are not at an institution with its own facilities, you can carry out high-end computations. No bureaucracy/politics – simply pay per CPU-hour.

Specific challenges with respect to data To what extent can/should datasets be included in the VM/suite or pulled in externally? How can we avoid the costliness of moving data around, as it gets bigger and bigger? To what extent are cross-domain standards for referring to and pulling in underlying datasets feasible. Dataset DOIs typically point to metadata Multiple versions of datasets. To what extent is it practical, when dealing with evolving datasets/databases, to make them available as reproducible snapshots? Culture of data sharing. How to get authors to share their data?

Conclusions With big data and computational tools, research is becoming more “reproducible/reusable” The infrastructure is out there; we need to do a better job of using it What authors need to communicate their research is also changing, and as publishers we must respond Clear publishers have a role, with other organisations, in setting some community standards It took a few 100 years, but publishing is now getting exciting

Questions? “One reason that the worldwide web worked was because people reused each other’s content in ways never imagined or achieved by those who created it. The same will be true of open data.” – Tim Berners-Lee and Nigel Shadbolt, The Times, New Year’s Eve 2011 Amye KenallJournal Development Manager (Open Data), BioMed Central @AmyeKenall (also @OpenDataBMC) amye.kenall@biomedcentral.com

Revolutionizing the Journal through Big Data Computational Research