1 / 10

RDA WG on Dynamic Data Citation

This working group focuses on making dynamic data citeable by addressing the challenges of citing arbitrary subsets of data and data that is constantly changing. The goal is to develop a scalable and implementable solution that remains stable across technology changes. The group also liaises with other initiatives on data citation.

milagros
Download Presentation

RDA WG on Dynamic Data Citation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RDA WG on Dynamic Data Citation

  2. RDA WG Data Citation • Research Data Alliance • WG on Data Citation:Making Dynamic Data Citeable • WG officially endorsed in March 2014 • 2 areas of focus • Citing arbitrary subsets of data • Citing data that is dynamic • Stable across technology changes, scalable, implementable • Not in focus • Metadata for citing data, landing page design • Which PID solution to adopt (DOI, URI, ARK, …) • Liaise with other initiatives on data citation (CODATA, DataCite, Force11, …) https://rd-alliance.org/working-groups/data-citation-wg.html

  3. Citation of Dynamic Data • Citable datasets have to be static • Fixed set of data, no changes:no corrections to errors, no new data being added • But: (research) data is dynamic • Adding new data, correcting errors, enhancing data quality, … • Changes sometimes highly dynamic, at irregular intervals • Current approaches • Identifying entire data stream, without any versioning • Using “accessed at” date (how useful is this on its own?) • “Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases • time-delayed • artificial boundaries • Would like to cite (and retrieve) precisely the data as it existed at certain point in time

  4. Granularity of Data Citation • What about granularity of data to be cited? • Databases collect enormous amounts of data over time • Researchers use specific subsets (“rows/columns”) of data • Need to identify precisely the subset used • Current approaches • Storing pre-defined sub-sets -> not flexible enough • Storing a copy of subset as used in study -> scalability • Citing entire dataset, providing textual description of subset-> imprecise (ambiguity) • Storing list of record identifiers in subset (“rows”) -> scalability, not for arbitrary subsets (e.g. when not entire record selected) • Would like to be able to cite precisely the subset of (dynamic) data used in a study

  5. Data Citation – Requirements Dynamic data corrections, additions, … Arbitrary subsets of data (granularity) rows/columns, time sequences, … from single number to the entire set Stable across technology changes e.g. migration to new database / data representation Machine-actionable not just machine-readable, definitely not just human-readable and interpretable Scalable to very large / highly dynamic datasets but should also work for small, static

  6. Dynamic Data Citation • We have: Data + Means-of-access • Dynamic Data Citation: cite data dynamically via query Steps: • Data  versioned (history, with time-stamps) Researcher creates working-set via some interface: • Access assign PID to “QUERY”, enhanced with • Time-stamping for re-execution against versioned DB • Re-writing for normalization, unique-sort, mapping to history • Hashing result-set: verifying identity/correctness leading to landing page S. Pröll, A. Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf

  7. Data Citation – Deployment Note: querystringprovidesvaluableprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned This is an importantadvantageovertraditional approachesrelying on, e.g. storing a listofidentifiers/DB dump!!!

  8. To Discuss • Today • Q/A on RDA recommendation • Needs in pilots present: sub-sets, dynamics, precision/automation • Feasibility of approach for pilots: SQL, CSV, XML, within-file, … • Effort for / Impact of / Interest in moving forward • Maybe for later • Which timestamp to assign (query-time, latest change global, latest change of data affected) • Importance of unique sort and query re-write complexity • Hash-key computation (entire result set, row/column header, subset, none) • Relationship between subset identified and entire dataset • Distributed data sources: linked / independent • Migration across technological changes

  9. Join RDA and Working Group If you are interested in joining the discussion, contributing a pilot, wish to establish a data citation solution, … Register for the RDA WG on Data Citation: Website:https://rd-alliance.org/working-groups/data-citation-wg.html Mailinglist: https://rd-alliance.org/node/141/archive-post-mailinglist Web Conferences:https://rd-alliance.org/webconference-data-citation-wg.html List of pilots:https://rd-alliance.org/groups/data-citation-wg/wiki/collaboration-environments.html

  10. Thank you! https://rd-alliance.org/working-groups/data-citation-wg.html

More Related