Reproducibility: On computational processes, dynamic data, and why we should bother

Reproducibility: On computational processes, dynamic data, and why we should bother

Outline What are the challenges in reproducibility? What do we gain from reproducibility?(and: why is non-reproducibility interesting?) How to address the challenges of complex processes? How to deal with “Big Data”? Summary

Challenges in Reproducibility http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234

Challenges in Reproducibility Excursion: Scientific Processes

Challenges in Reproducibility Excursion: scientific processes set1_freq440Hz_Am11.0Hz set1_freq440Hz_Am12.0Hz set1_freq440Hz_Am05.5Hz Java Matlab

Challenges in Reproducibility Excursion: Scientific Processes • Bug? • Psychoacoustic transformation tables? • Forgetting a transformation? • Different implementation of filters? • Limited accuracy of calculation? • Difference in FFT implementation? • ...?

Challenges in Reproducibility • Workflows Taverna

Challenges in Reproducibility

Challenges in Reproducibility • Large scale quantitative analysis • Obtain workflows from MyExperiments.org • March 2015: almost 2.700 WFs (approx. 300-400/year) • Focus on Taverna 2 WFs: 1.443 WFs • Published by authors  should be „better quality“ • Try to re-execute the workflows • Record data on the reasons for failure along • Analyse the most common reasons for failures

Challenges in Reproducibility Re-Execution results • Majority of workflows fails • Only 23.6 % are successfully executed • No analysis yet on correctness of results… Rudolf Mayer, Andreas Rauber, “A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows”, 11th IEEE Intl. Conference on e-Science, 2015.

Computer Science • 613 papers in 8 ACM conferences • Process • download paper and classify • search for a link to code (paper, web, email twice) • download code • build and execute Christian Collberg and Todd Proebsting. “Repeatability in Computer Systems Research,” CACM 59(3):62-69.2016

Challenges in Reproducibility In a nutshell – and another aspect of reproducibility: Source: xkcd

Outline What are the challenges in reproducibility? What do we gain by aiming for reproducibility? How to address the challenges of complex processes? How to deal with dynamic data? Summary

Reproducibility – solved! (?) • Provide source code, parameters, data, … • Wrap it up in a container/virtual machine, … … • Why do we want reproducibility? • Which levels or reproducibility are there? • What do we gain by different levels of reproducibility? LXC

Reproducibility – solved! (?) • Dagstuhl Seminar:Reproducibility of Data-Oriented Experiments in e-ScienceJanuary 2016, Dagstuhl, Germany

Types of Reproducibility • The PRIMAD1 model: which attributes can we “prime”? • Data • Parameters • Input data • Plattform • Implementation • Method • Research Objective • Actors • What do we gain by priming one or the other? [1] Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented Experiments in eScience. DagstuhlReports, 6(1), 2016.

Types of Reproducibility and Gains

Reproducibility Papers • Aim for reproducibility: for one’s own sake – and as Chairs of conference tracks, editor, reviewer, superviser, … • Review of reproducibility of submitted work (material provided) • Encouraging reproducibility studies • (Messages to stakeholders in Dagstuhl Report) • Consistency of results, not identity! • Reproducibility studies and papers • Not just re-running code / a virtual machine • When is a reproducibility paper worth the effort / worth being published?

Reproducibility Papers • When is a Reproducibility paper worth being published?

Learning from Non-Reproducibility • Do we always want reproducibility? • Scientifically speaking: yes! • Research is addressing challenges: • Looking for and learning from non-reproducibility! • Non-reproducibility if • Some (un-known) aspect of a study influences results • Technical: parameter sweep, bug in code, OS, … -> fix it! • Non-technical: input data! (specifically: “the user”)

Learning from Non-Reproducibility Challenges in MIR – “things don’t seem to work” • Virtual Box, Github, <your favourite tool> are starting points • Same features, same algorithm, different data -> • Same data, different listeners -> • Understanding “the rest”: • Isolating unknown influence factors • Generating hypotheses • Verifying these to understand the “entire system”, cultural and other biases, … • Benchmarks and Meta-Studies

Outline What are the challenges in reproducibility? What do we gain by aiming for reproducibility? How to address the challenges of complex processes? How to deal with “Big Data”? Summary

Deja-vue… http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234

And the solution is… Standardization and Documentation Standardized components, procedures, workflows Documenting complete system set-up across entire provenance chain How to do this – efficiently? Alexander Graham Bell’s Notebook, March 9 1876 https://commons.wikimedia.org/wiki/File:Alexander_Graham_Bell's_notebook,_March_9,_1876.PNG Pieter Bruegel the Elder: De Alchemist (British Museum, London)

Documenting a Process • Context Model: establish what to document and how • Meta-model for describing process & context • Extensible architecture integrated by core model • Reusing existing models as much as possible • Based on ArchiMate, implemented using OWL • Extracted by static and dynamic analysis

Context Model – Static Analysis • Analyses steps, platforms, services, tools called • Dependencies (packages, libraries) • HW, SW Licenses, … #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex Script Context Model(OWL ontology) ArchiMate model Taverna Workflow

Context Model – Dynamic Analysis • Process Migration Framework (PMF) • designed for automatic redeployments into virtual machines • uses strace to monitor system calls • complete log of all accessed resources (files, ports) • captures and stores process instance data • analyse resources (file formats via PRONOM, PREMIS)

Context Model – Dynamic Analysis Taverna Workflow

Process Capture • Preservationand Re-deployment • „Encapsulate“ ascomplex Research Object (RO) • DP: Re-Deploymentbeyond original environment • Format migrationofelementsof ROs • Cross-compilationofcode • Emulation-as-a-Service • Verification upon re-deployment

VFramework Original environment Repository Redeployment environment Preserve Redeploy Are these processes the same?

VFramework

VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex

VFramework ADDED #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex NOT USED

Outline What are the challenges in reproducibility? What do we gain by aiming for reproducibility? How to address the challenges of complex processes? How to deal with “Big Data”? Summary

Data and Data Citation • So far focus on the process • Processes work with data • Data as a “1st-class citizen” in science • We need to be able to • preserve data and keep it accessible • cite data to give credit and show which data was used • identifypreciselythe data used in a study/process for reproducibility, evaluating progress,… • Why is this difficult?(after all, it’s being done…)

Data and Data Citation • Common approaches to data management…(from PhD Comics: A Story Told in File Names, 28.5.2010) Source: http://www.phdcomics.com/comics.php?f=1323

Identification of Dynamic Data • Citable datasets have to be static • Fixed set of data, no changes:no corrections to errors, no new data being added • But: (research) data is dynamic • Adding new data, correcting errors, enhancing data quality, … • Changes sometimes highly dynamic, at irregular intervals • Current approaches • Identifying entire data stream, without any versioning • Using “accessed at” date • “Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases (time-delayed!) • Would like to identify precisely the data as it existed at a specific point in time

Granularity of Data Identification • What about the granularity of data to be identified? • Databases collect enormous amounts of data over time • Researchers use specific subsets of data • Need to identify precisely the subset used • Current approaches • Storing a copy of subset as used in study -> scalability • Citing entire dataset, providing textual description of subset-> imprecise (ambiguity) • Storing list of record identifiers in subset -> scalability, not for arbitrary subsets (e.g. when not entire record selected) • Would like to be able to identify precisely the subset of (dynamic) data used in a process

RDA WG Data Citation • Research Data Alliance • WG on Data Citation:Making Dynamic Data Citeable • WG officially endorsed in March 2014 • Concentrating on the problems of large, dynamic (changing) datasets • Focus! Identification of data!Not: PID systems, metadata, citation string, attribution, … • Liaise with other WGs and initiatives on data citation (CODATA, DataCite, Force11, …) • https://rd-alliance.org/working-groups/data-citation-wg.html

Making Dynamic Data Citeable Data Citation: Data + Means-of-access • Data  time-stamped & versioned (aka history) Researcher creates working-set via some interface: • Access assign PID to QUERY, enhanced with • Time-stamping for re-execution against versioned DB • Re-writing for normalization, unique-sort, mapping to history • Hashing result-set: verifying identity/correctness leading to landing page • Andreas Rauber, Ari Asmi, Dieter van Uytvanck and Stefan Proell. Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use.Bulletin of IEEE Technical Committee on Digital Libraries (TCDL), vol. 12, 2016http://www.ieee-tcdl.org/Bulletin/current/papers/IEEE-TCDL-DC-2016_paper_1.pdf • Stefan Pröll and Andreas Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf • Prototype for CSV: http://datacitation.eu/

Data Citation – Deployment • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage

Data Citation – Deployment Note: querystringprovidesexcellentprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage

Data Citation – Deployment Note: querystringprovidesexcellentprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage This is an importantadvantageovertraditional approachesrelying on, e.g. storing a listofidentifiers/DB dump!!!

Data Citation – Deployment Note: querystringprovidesexcellentprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage This is an importantadvantageovertraditional approachesrelying on, e.g. storing a listofidentifiers/DB dump!!! Identifywhichpartsofthedataareused. Ifdatachanges, identifywhichqueries (studies) areaffected

Data Citation – Output • 14 Recommendationsgrouped into 4 phases: • Preparing data and query store • Persistently identifying specific data sets • Resolving PIDs • Upon modifications to the data infrastructure • 2-page flyer https://rd-alliance.org/system/files/documents/RDA-DC-Recommendations_151020.pdf • More detailed Technical Report:http://www.ieee-tcdl.org/Bulletin/current/papers/IEEE-TCDL-DC-2016_paper_1.pdf • Reference implementations(SQL, CSV, XML) and Pilots

Join RDA and Working Group If you are interested in joining the discussion, contributing a pilot, wish to establish a data citation solution, … Register for the RDA WG on Data Citation: Website:https://rd-alliance.org/working-groups/data-citation-wg.html Mailinglist: https://rd-alliance.org/node/141/archive-post-mailinglist Web Conferences:https://rd-alliance.org/webconference-data-citation-wg.html List of pilots:https://rd-alliance.org/groups/data-citation-wg/wiki/collaboration-environments.html

3 Take-Away Messages Message 1 Aim at achieving reproducibility at different levels Re-run, ask others to re-run Re-implement Port to different platforms Test on different data, vary parameters(and report!) If something is not reproducible -> investigate!(you might be onto something!) Encourage reproducibility studies!

Reproducibility: On computational processes, dynamic data, and why we should bother

Reproducibility: On computational processes, dynamic data, and why we should bother

Presentation Transcript

eLearning: Why Bother?

Why bother?

Formative evaluation: Why we should still bother

Governance – why bother?

Mirrors why bother?

Why Bother?

Accuracy and Reproducibility of Data

Why bother?

Why bother?

Why Should My Conscience Bother Me?

Why, oh why, should I bother with plants?

eLearning Why Bother?

Why should we bother to study deep-sea biology?

Why should “WE” CARE about data?

Why bother ?

Why bother?

Collaboration: Why Bother?

On reproducibility

Why we should have a dynamic website?

Why and How should we focus on Results?

Music, Why Bother?

Survivorship – why bother?