580 likes | 593 Views
Explore the challenges in achieving reproducibility, the importance of reproducibility, and strategies to address complex processes and big data. Gain insights from research studies and learn about the benefits of reproducibility.
E N D
Reproducibility: On computational processes, dynamic data, and why we should bother
Outline What are the challenges in reproducibility? What do we gain from reproducibility?(and: why is non-reproducibility interesting?) How to address the challenges of complex processes? How to deal with “Big Data”? Summary
Challenges in Reproducibility http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234
Challenges in Reproducibility Excursion: Scientific Processes
Challenges in Reproducibility Excursion: scientific processes set1_freq440Hz_Am11.0Hz set1_freq440Hz_Am12.0Hz set1_freq440Hz_Am05.5Hz Java Matlab
Challenges in Reproducibility Excursion: Scientific Processes • Bug? • Psychoacoustic transformation tables? • Forgetting a transformation? • Different implementation of filters? • Limited accuracy of calculation? • Difference in FFT implementation? • ...?
Challenges in Reproducibility • Workflows Taverna
Challenges in Reproducibility • Large scale quantitative analysis • Obtain workflows from MyExperiments.org • March 2015: almost 2.700 WFs (approx. 300-400/year) • Focus on Taverna 2 WFs: 1.443 WFs • Published by authors should be „better quality“ • Try to re-execute the workflows • Record data on the reasons for failure along • Analyse the most common reasons for failures
Challenges in Reproducibility Re-Execution results • Majority of workflows fails • Only 23.6 % are successfully executed • No analysis yet on correctness of results… Rudolf Mayer, Andreas Rauber, “A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows”, 11th IEEE Intl. Conference on e-Science, 2015.
Computer Science • 613 papers in 8 ACM conferences • Process • download paper and classify • search for a link to code (paper, web, email twice) • download code • build and execute Christian Collberg and Todd Proebsting. “Repeatability in Computer Systems Research,” CACM 59(3):62-69.2016
Challenges in Reproducibility In a nutshell – and another aspect of reproducibility: Source: xkcd
Outline What are the challenges in reproducibility? What do we gain by aiming for reproducibility? How to address the challenges of complex processes? How to deal with dynamic data? Summary
Reproducibility – solved! (?) • Provide source code, parameters, data, … • Wrap it up in a container/virtual machine, … … • Why do we want reproducibility? • Which levels or reproducibility are there? • What do we gain by different levels of reproducibility? LXC
Reproducibility – solved! (?) • Dagstuhl Seminar:Reproducibility of Data-Oriented Experiments in e-ScienceJanuary 2016, Dagstuhl, Germany
Types of Reproducibility • The PRIMAD1 model: which attributes can we “prime”? • Data • Parameters • Input data • Plattform • Implementation • Method • Research Objective • Actors • What do we gain by priming one or the other? [1] Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented Experiments in eScience. DagstuhlReports, 6(1), 2016.
Reproducibility Papers • Aim for reproducibility: for one’s own sake – and as Chairs of conference tracks, editor, reviewer, superviser, … • Review of reproducibility of submitted work (material provided) • Encouraging reproducibility studies • (Messages to stakeholders in Dagstuhl Report) • Consistency of results, not identity! • Reproducibility studies and papers • Not just re-running code / a virtual machine • When is a reproducibility paper worth the effort / worth being published?
Reproducibility Papers • When is a Reproducibility paper worth being published?
Learning from Non-Reproducibility • Do we always want reproducibility? • Scientifically speaking: yes! • Research is addressing challenges: • Looking for and learning from non-reproducibility! • Non-reproducibility if • Some (un-known) aspect of a study influences results • Technical: parameter sweep, bug in code, OS, … -> fix it! • Non-technical: input data! (specifically: “the user”)
Learning from Non-Reproducibility Challenges in MIR – “things don’t seem to work” • Virtual Box, Github, <your favourite tool> are starting points • Same features, same algorithm, different data -> • Same data, different listeners -> • Understanding “the rest”: • Isolating unknown influence factors • Generating hypotheses • Verifying these to understand the “entire system”, cultural and other biases, … • Benchmarks and Meta-Studies
Outline What are the challenges in reproducibility? What do we gain by aiming for reproducibility? How to address the challenges of complex processes? How to deal with “Big Data”? Summary
Deja-vue… http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234
And the solution is… Standardization and Documentation Standardized components, procedures, workflows Documenting complete system set-up across entire provenance chain How to do this – efficiently? Alexander Graham Bell’s Notebook, March 9 1876 https://commons.wikimedia.org/wiki/File:Alexander_Graham_Bell's_notebook,_March_9,_1876.PNG Pieter Bruegel the Elder: De Alchemist (British Museum, London)
Documenting a Process • Context Model: establish what to document and how • Meta-model for describing process & context • Extensible architecture integrated by core model • Reusing existing models as much as possible • Based on ArchiMate, implemented using OWL • Extracted by static and dynamic analysis
Context Model – Static Analysis • Analyses steps, platforms, services, tools called • Dependencies (packages, libraries) • HW, SW Licenses, … #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex Script Context Model(OWL ontology) ArchiMate model Taverna Workflow
Context Model – Dynamic Analysis • Process Migration Framework (PMF) • designed for automatic redeployments into virtual machines • uses strace to monitor system calls • complete log of all accessed resources (files, ports) • captures and stores process instance data • analyse resources (file formats via PRONOM, PREMIS)
Context Model – Dynamic Analysis Taverna Workflow
Process Capture • Preservationand Re-deployment • „Encapsulate“ ascomplex Research Object (RO) • DP: Re-Deploymentbeyond original environment • Format migrationofelementsof ROs • Cross-compilationofcode • Emulation-as-a-Service • Verification upon re-deployment
VFramework Original environment Repository Redeployment environment Preserve Redeploy Are these processes the same?
VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex
VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex
VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex
VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex
VFramework ADDED #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex NOT USED
Outline What are the challenges in reproducibility? What do we gain by aiming for reproducibility? How to address the challenges of complex processes? How to deal with “Big Data”? Summary
Data and Data Citation • So far focus on the process • Processes work with data • Data as a “1st-class citizen” in science • We need to be able to • preserve data and keep it accessible • cite data to give credit and show which data was used • identifypreciselythe data used in a study/process for reproducibility, evaluating progress,… • Why is this difficult?(after all, it’s being done…)
Data and Data Citation • Common approaches to data management…(from PhD Comics: A Story Told in File Names, 28.5.2010) Source: http://www.phdcomics.com/comics.php?f=1323
Identification of Dynamic Data • Citable datasets have to be static • Fixed set of data, no changes:no corrections to errors, no new data being added • But: (research) data is dynamic • Adding new data, correcting errors, enhancing data quality, … • Changes sometimes highly dynamic, at irregular intervals • Current approaches • Identifying entire data stream, without any versioning • Using “accessed at” date • “Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases (time-delayed!) • Would like to identify precisely the data as it existed at a specific point in time
Granularity of Data Identification • What about the granularity of data to be identified? • Databases collect enormous amounts of data over time • Researchers use specific subsets of data • Need to identify precisely the subset used • Current approaches • Storing a copy of subset as used in study -> scalability • Citing entire dataset, providing textual description of subset-> imprecise (ambiguity) • Storing list of record identifiers in subset -> scalability, not for arbitrary subsets (e.g. when not entire record selected) • Would like to be able to identify precisely the subset of (dynamic) data used in a process
RDA WG Data Citation • Research Data Alliance • WG on Data Citation:Making Dynamic Data Citeable • WG officially endorsed in March 2014 • Concentrating on the problems of large, dynamic (changing) datasets • Focus! Identification of data!Not: PID systems, metadata, citation string, attribution, … • Liaise with other WGs and initiatives on data citation (CODATA, DataCite, Force11, …) • https://rd-alliance.org/working-groups/data-citation-wg.html
Making Dynamic Data Citeable Data Citation: Data + Means-of-access • Data time-stamped & versioned (aka history) Researcher creates working-set via some interface: • Access assign PID to QUERY, enhanced with • Time-stamping for re-execution against versioned DB • Re-writing for normalization, unique-sort, mapping to history • Hashing result-set: verifying identity/correctness leading to landing page • Andreas Rauber, Ari Asmi, Dieter van Uytvanck and Stefan Proell. Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use.Bulletin of IEEE Technical Committee on Digital Libraries (TCDL), vol. 12, 2016http://www.ieee-tcdl.org/Bulletin/current/papers/IEEE-TCDL-DC-2016_paper_1.pdf • Stefan Pröll and Andreas Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf • Prototype for CSV: http://datacitation.eu/
Data Citation – Deployment • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage
Data Citation – Deployment Note: querystringprovidesexcellentprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage
Data Citation – Deployment Note: querystringprovidesexcellentprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage This is an importantadvantageovertraditional approachesrelying on, e.g. storing a listofidentifiers/DB dump!!!
Data Citation – Deployment Note: querystringprovidesexcellentprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage This is an importantadvantageovertraditional approachesrelying on, e.g. storing a listofidentifiers/DB dump!!! Identifywhichpartsofthedataareused. Ifdatachanges, identifywhichqueries (studies) areaffected
Data Citation – Output • 14 Recommendationsgrouped into 4 phases: • Preparing data and query store • Persistently identifying specific data sets • Resolving PIDs • Upon modifications to the data infrastructure • 2-page flyer https://rd-alliance.org/system/files/documents/RDA-DC-Recommendations_151020.pdf • More detailed Technical Report:http://www.ieee-tcdl.org/Bulletin/current/papers/IEEE-TCDL-DC-2016_paper_1.pdf • Reference implementations(SQL, CSV, XML) and Pilots
Join RDA and Working Group If you are interested in joining the discussion, contributing a pilot, wish to establish a data citation solution, … Register for the RDA WG on Data Citation: Website:https://rd-alliance.org/working-groups/data-citation-wg.html Mailinglist: https://rd-alliance.org/node/141/archive-post-mailinglist Web Conferences:https://rd-alliance.org/webconference-data-citation-wg.html List of pilots:https://rd-alliance.org/groups/data-citation-wg/wiki/collaboration-environments.html
3 Take-Away Messages Message 1 Aim at achieving reproducibility at different levels Re-run, ask others to re-run Re-implement Port to different platforms Test on different data, vary parameters(and report!) If something is not reproducible -> investigate!(you might be onto something!) Encourage reproducibility studies!