230 likes | 240 Views
This project aims to develop a functioning registry system that allows for the explication and sharing of assumptions in data through types and type registries. It addresses the problem of implicit assumptions in data and seeks to provide a solution for identifying and characterizing these assumptions. The goal is to enable data to be parsed, understood, and reused by individuals and applications other than the data creators.
E N D
Data Type Registries #2 RDA Chairs’ Mtg Gothenburg June2017 Co-Chairs: Larry Lannom - CNRI Tobias Weigel – DKRZ
Problem: Implicit Assumptions in Data • Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data • How do we do this now? • For documents – formats are enough, e.g., PDF, and then the document explains itself to humans • This doesn’t work well with data – numbers are not self-explanatory • What does the number 7 mean in cell B27? • Data producers may not have explicitly specified certain details in the data: measurement units, coordinate systems, variable names, etc. • Need a way to precisely characterize those assumptions such that they can be identified by humans and machines that were not closely involved in its creation
Goal of the DTR Effort: Explicate and Share Assumptions using Types and Type Registries • Evaluate and identify a few assumptions in data that can be codified and shared in order to… • Produce a functioning Registry system that can easily be evaluated by organizations before adoption • Highly configurable for changing scope of captured and shared assumptions depending on the domain or organization • Supports several Type record dissemination variations • Design for allowing federation between multiple Registry instances • The emphasis is not on • Identifying every possible assumption and data characteristic applicable for all domains • Technology
What is a Data Type? • A unique and resolvable identifier • Which resolves to characterization of structures, conventions, semantics, and representations of data • Serves as a shortcut for humans and machines to understand and process data • File formats and mime types have solved the ‘representation’ problem at a ‘unit’ level • Examples of problems we aim to solve with data types: • It is a number in cell A3, but is it temperature? If so, in Celsius? • It is a dataset consisting of location, temperature, and time, but what variable names should I look for? • Is it all packaged as CSV or NetCDF? And as a single unit or a collection of units? • Type record structure will continue to evolve – not finished, but functioning
What is a Data Type Registry? • A low-level infrastructure with wide applicability to record and disseminate type records • Not an immediate ROI application • Assigns unique and resolvable identifiers to type records • Enforces and validates common data model & expression for interoperation between multiple instances of Registries • API for machine consumption • UI for human use
Process Use Case ID ID Type ID ID Type Users ID Payload Type Type Federated Set of Type Registries ID Payload Type Payload Payload Type Payload Payload 4 4 3 2 2 1 3 1 4 Typed Data Terms:… I Agree 10100 11010 101…. Visualization Rights Data Set Dissemination Data Processing Services Client (process or people) encounters unknown data type. Resolved to Type Registry. Response includes type definitions, relationships, properties, and possibly service pointers. Response can be used locally for processing, or, optionally typed data or reference to typed data can be sent to service provider.
Discovery Use Case ID ID Type ID ID Type Users ID Payload Type Type ID Payload Type Payload Payload Type Payload Payload 1 4 2 3 1 3 2 4 Repositories and Metadata Registries Federated Set of Type Registries Clients (process or people) look for types that match their criteria for data. For example, clients may look for types that match certain criteria, e.g., combine location, temperature, and date-time stamp. Type Registry returns matching types. Clients look up in repositories and metadata registries for data sets matching those types. Appropriate typed data is returned.
Type Registry History • Handle Types – 0.Type/SomeType • Good idea, limited applicability • Profiles – ‘type’ the whole set of handle/type/value triples. No traction • Sloan Grant: 2012 -2014 • Generic Registry system using Type Registry as a use case • NSF Grant: 2013 -2014 • Included support for Type Registry • Research Data Alliance (RDA) Data Type Registries Working Group • One of first two WGs approved at Plenary 1 – March 2013 • International representation, > 50 members • Co-chairs Lannom (CNRI), Broeder (Max Planck Psycholinguistics Institute) • Approved as RDA Recommendation (2015) • DTR Phase 2: Follow-on Group • Focus on data type records • Help data producers create useful record types • Co-chairs Lannom (CNRI), Weigel (DKRZ)
Current State • A prototype is at: http://typeregistry.org/ • Multiple other implementations/projects • DKRZ, Vermont Monitoring Coop, RPID project • Implementation supports notions of primitives and derived types • Primitives are fundamental types that we expect humans and software to parse and understand • Integer, floating point, boolean value, string, date, timestamp, etc. • Derived types depend on primitives to describe something complex • Stream gauge, Lidar, Spatial bounding box, etc. • Registered types are assigned unique identifiers • ISO Study Group
ISO Study Group Activity on Data Type Records: Background • ISO-IEC/JTC1/SC32/WG2 is currently working on a meta model for dataset description. • That meta model covers data elements "about" datasets versus "internal details" of datasets. • We refer to the record that captures internal details a "data type record”. • WG2 was receptive to the idea of exploring the "data type record" space. A study group was authorized Nov 2016, with CNRI as lead • ISO-IEC/JTC1/SC32/WG2/SG • Present use case(s) from existing RDA members and see what fields are needed to describe the internals of the datasets pertaining the use case.(s) • Evaluate what existing ISO standards cover and what they do not. • If applicable, recommend a technical report or technical specification or standard. • The study group terminates June 2017
ISO Study Group Activity on Data Type Records: Work and Possible Outcomes • ISO members referenced portions of existing ISO standards that are applicable to the proposed data type record structure (11179-3, 11179-7, 19763-12, and 11404) • CNRI will create a UML diagram that, wherever applicable, references the pieces from each of those standards. • Feb 10th – ‘Data Type Record – Elements for Characterizing Data’ submitted to WG. • Has been posted to DTR site • Based in part on CMIP6 and Vermont Monitoring Coop examples • For discussion purposes, CNRI introduced the notion of a simple data type and a complex data type. • Simple data type describes characteristics of a single value (e.g., a cell in a spreadsheet) • Complex data type is an aggregate of simple data types (e.g., a row in a spreadsheet can be described using a complex data type) • June ’17 Decision • Technical Report: documentation on how to use existing standards for the data type use case • Technical Specification: intermediate spec, possible future standard, still under development • Technical Standard: something new and useful, fully standardized • Good news: we get an ISO number, no matter what happens
Data Type Registry Data set descriptions for automation Stephen M Richard, IEDA, EarthCube
Use cases • Document the meaning of entities and attributes in data. • Re-use of data type and attribute definitions • Machine-assisted data integration: matching attribute content. • Validation of data instances against a type definition. • Tools that spin up a UI for a particular data type. • Link software to data sources that it can use • Support file introspection to assist with deep data registration
Progress • JSON schema implemented for data type model • https://github.com/usgin/digital-crust-LDRDataTypeJSON.json • Initial testing with Cordra • Next steps– how to get model compilations from spreadsheet or rdf to Cordra. • Where to deploy
Corporation for National Research Initiatives enrich.cordra.org • Enrich DTR • Service enhances dataset metadata • Syntactic nature (Decimal values between 0.0 – 24.0) • Semantic information (Concept time in unit hours) • Currently applying the DTR at an attribute level (Concept and Unit) • Expand to the dataset level • Dataset Metadata (row/column count, maintainer, description, etc.) • Dataset Schema (columns information – constraints, formats, semantic information, etc.) • Utilizing DTR mapping, in addition to other metadata, to drive a dataset recommendation system. • Continuing to collaborate with CNRI on this.
Concept as it was at P8 netCDF-Files Agent Collection script Processing service (WPS) <Metadata> (xml) ? (third-party input) • well-defined ways to publish it (automatically) • possible repacking into a new collection output multiple types, e.g. netcdf, xml, linked data, text reports, PROV record described in DTR https://rd-alliance.org/ - https://twitter.com/resdatall
Pathway for climate data processing services • Multiple upcoming angles for attaching a DT solution: • CMIP6 data • Handles, Handle records, potential PID Kernel Information profile • Copernicus Data Services • re-using existing WPS-based services • Handles? – Depends on success and acceptance of CMIP6 solution and available effort • Climate Analytics Service (DKRZ/CMCC) • server-side data processing with PID support • data sources: CMIP6, possibly Copernicus, ... https://rd-alliance.org/ - https://twitter.com/resdatall
Typing angles • a) Service typing • automated discovery or at least verification • b) Data input and output typing (coarse) • to distinguish the data sources and help with management • c) Typing of data internals (fine) • traditional DTR: enable machines to understand meaning of data https://rd-alliance.org/ - https://twitter.com/resdatall