290 likes | 306 Views
Data Management Plans: A good idea, but not sufficient. Outline. Why are Data Management Plans good but insufficient? From Data to Process Management Plans How to capture process & context? Summary. Sustainable (e-)Science. Data is key enabler in science
E N D
Outline Why are Data Management Plans good but insufficient? From Data to Process Management Plans How to capture process & context? Summary
Sustainable (e-)Science Data is key enabler in science Basis for evaluation and verification Basis for re-use Basis for meta-studies Safeguarding investment made in data Need to preserve and curate the data Preservation: keeping useable over time fighting mostly technical & semantic obsolescence How to avoid data being lost after projects end?
Sustainable (e-)Science Data Management Plans as integral part of research proposals Need recognized by researchers, funding bodies,… Focus on Data Descriptions Declarations of activities to ensure long-term availability of data Data Management Plans are good, but not sufficient! https://dmp.cdlib.org/ https://data.uni-bielefeld.de/de/data-management-plan https://dmponline.dcc.ac.uk/
Data Management Plans Short, free-form text, requiring human interpretation Declarations of intent Not enforceable, hardly verifiable (Burden remains with researchers / institutions, who need to become data management experts) Focuses solely on data, ignoring the process:pre-processing, processing, analysis Limits availability of data & results verification of results, re-use and re-purposing http://rci.ucsd.edu/_files/DMP%20Example%20Cosman.pdf http://deepblue.lib.umich.edu/bitstream/handle/2027.42/86586/CoE_DMP_template_v1.pdf?sequence=1
From Data to Processes Excursion: Scientific Processes
From Data to Processes Rhythm Pattern Feature Set extracts numeric descriptors from audio basically 2 Fourier Transforms some psycho-acoustic modelling some filters (gaussian, gradient) to make features more robust Used for music genre classification clustering of music by similarity retrieval Implemented first in Matlab, then in Java both publicly available on website same same but different...
From Data to Processes Excursion: scientific processes set1_freq440Hz_Am11.0Hz set1_freq440Hz_Am12.0Hz set1_freq440Hz_Am05.5Hz Java Matlab
From Data to Processes Excursion: Scientific Processes • Bug? • Psychoacoustic transformation tables? • Forgetting a transformation? • Diferent implementation of filters? • Limited accuracy of calculation? • Difference in FFT implementation? • ...?
From Data to Processes http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234
From Data to Processes To sum up: Data is the fuel for scientific processes is the result of scientific processes Curation of data thus needs to consider these processes Data Management Plans are data centric put too little focus on the processes associated with data are written by humans for humans
Outline Why are Data Management Plans insufficient? From Data to Process Management Plans How to capture process & context? Summary
Process Management Plans Process Management Plans (PMPs) Go beyond data to cover research process: ideas, steps, tools, documentation, results, … data is only one (important) element, commonly actually a result of a research (pre-)process Ensure re-executability, re-usability Must be machine-actionable & verifiable Basis for preservation and re-use of research Similar to “research objects”, “executable papers”, …
Process Management Plans Need to establish Models for representing such process management plans (PMPs) Must be machine-readable and machine-actionable Identify “minimum set” of information Devise means to automate (most of) the activity in creating and maintaining those PMPs Establish them to replace (enhance / subsume / …) Data Management Plans
Process Management Plans Structure of PMPs (following concept of DMPs): Overview and context Description of processes and their implementation Process description | Process implementation | Data used and produced by process Preservation Preservation history | Long term storage and funding Sharing and reuse Sharing | Reuse | Verification | Legal aspects Monitoring and external dependencies Adherence and Review
Outline Why are Data Management Plans insufficient? From Data to Process Management Plans How to capture process & context? Summary
Process Capture Need to establish what forms part of a process: analyzing process documentation establishing context of process, relationships between elements monitoring of process activities Capture and describe this in a context model
Architectural Concepts • Based on Enterprise Architecture Framework(Zachmann), taxonomies (e.g. PREMIS), … • DIO: Domain-Independent Ontology • DSO: Domain-Specific Ontologies(legal, sensor, multimedia codecs, …)
Process Capture Example: Music Classification Process • Input: music (e.g. MP3 format) • Input: trainingdata, i.e. musicwithgenrelabels • Output: classificationofmusic, e.g. intogenres • Intermediate steps • extractnumericdescription (features) frommusic • combinefeatureswithgroundtruthintospecificfileformat, …
Process Capture Taverna …………….
Process Capture Software setup can be automatically detected in OS with software packages (e.g. Linux); allows detection of licenses, dependencies
Process Capture • Example: • Music Classification Workflow
Business Application Technology
Process Re-deployment • Preservationand Re-deployment • „Encapsulate“ ascomplex „researchobjects“ (RO) • Re-Deploymentbeyond original environment • Format migrationofelementsof ROs • Cross-compilationofcode • Emulation-as-a-Service, virtualmachines, …
Process Re-deployment • Verification, Validation & Data • Verifycorrectnessofre-execution • validationandverificationframework • processinstancedata • pointsofcapture • Metrics • Data anddatacitation • Identifyingsubsetsofdata in large anddynamicdatabases • Timestampingandversioningofdata • Assigning PID (DOI, …) to time-stampedquery
Sustainable (e-)Science How to get there? Research infrastructure support Versioning systems Logging (“virtual lab-book”) Virtual machines / pre-configured virtual labs for research Data citation support for large, dynamic databases R&D in process preservation, re-deployment & verification Evolving research environments, code migration, … Verification of process re-execution Financial impact, business models
Summary Need to move beyond concept of data Need to move beyond the focus on description Process Management Plans (PMPs) extending DMPs Process capture, preservation & verification Capture “all” elements of a research process Machine-readable and -actionable Data and process re-use as basis for data driven science
Thank you! http://www.ifs.tuwien.ac.at/imp