390 likes | 511 Views
Implementing the DDI4 Model in the R Language. Package DDI4R - DDI4 in R6 Classes European DDI User Conference 2018, Berlin Germany. Metadata in the Statistical Package Workflow. Transparency and replication are crucial for open science Capture metadata at the earliest point possible
E N D
Implementing the DDI4 Model in the R Language Package DDI4R - DDI4 in R6 Classes European DDI User Conference 2018, Berlin Germany Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Metadata in the Statistical Package Workflow Transparency and replication are crucial for open science • Capture metadata at the earliest point possible • Building this into the scientific workflow • means making tools available in the platforms people use • Attach the metadata to the data to travel together • Especially important for tables and variables • Example: unit of measurement, scale, precision • When a variable is created, this has to be known. Attach it to the variable. • Operating on Metadata • Harmonization procedures Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Why DDI4? • UML Class Model maps well to programming languages • Classes and Objects Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Why R • Open Source • Popular, cross domain • Supports user extensions (packages) • Supports an Object Oriented approach Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
DDI4 implemented as R6 Classes • R6 classes are simple to work with and follow familiar object-oriented approaches • One R6 class for each DDI4 class • Additional support functions • R6 documentation: • https://cran.r-project.org/web/packages/R6/vignettes/Introduction.html • https://www.rdocumentation.org/packages/R6/versions/2.2.2 • DDI4 Model • https://lion.ddialliance.org/ Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Classes vs Objects • A Class is a template – it defines a structure • An Object implements a Class and contains specific content • Example • “Individual” is a class that defines what information can be contained about a person • “Bob” is an object that implements “Individual” and contains information about an actual person Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Production of the DDI4R package DDI4 Model XMI Exporter XMI* R Generator (Python) R Package Source Code R devtools DDI4R R Package * https://lion.ddialliance.org/xmi.xml Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Example Class: InstanceVariable - Definition and Other Explanatory Material Parent: RepresentedVariable Descriptive information Definition Explanatory Notes Example Synonyms DDI3.2 Mapping GSIM mapping Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
R help for each Class ?InstanceVariable Includes all of the documentation from the DDI4 model Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
?InstanceVariable Continued: Ancestry Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
R Help – Package level ??DDI4R Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
R Help – Package level ?DDI4R Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
DDI4 Classes and Attributes Categorized For the purpose of implementing DDI4 in R, the DDI4 properties and associations can be assigned to five categories • Primitives • e.g. xs:date, xs:string • Structured Datatypes • not Identifiable, but can have multiple properties, no associations • Enumerations • Single value, enforced from a list • Regular Expression • Single Value, enforced through a regular expression • All other content classes (Identifiables) • Can have multiple properties and relationships • Identifiable (has a DDI identifier – agency, ID, and version) Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Example Class: InstanceVariable- Properties with Primitive Datatypes Properties These datatypes are primitives Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Example Class ValueString: An enumerated value This datatype is a DDI4 Enumerated Class WhiteSpaceRule specifies that only these text strings can be entered Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
InstanceVariable– Property with a Structured Datatype This datatype is a DDI4 Structured Datatype Class Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Example Class: InstanceVariable - Relationships Relationships These are other Identifiable DDI4 Classes. They are Reusable Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
DDI4 Example: UML diagram for Code • The content class Code: • Inherits from Designation, an abstract class • Has the relationship “denotes” to a Category (another content class) • Has a property of “representation” that has a structured datatype of ValueString, which has • “content” a primitive text property • “whiteSpace” an enumeration property Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
DDI4 Inheritance Example: Identifiable • Inheritance • Code, Concept, and Category all inherit from Identifiable • This means they all have the properties: • agency • id • version • Together these properties form a globally unique DDI identifier • These can be embedded in a single string – the DDI URN Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
DDI4R Objects Represent the following DDI4 Objects: In the DDI Model Primitives Structured Datatypes Enumerations Regular Expression All other content classes (Identifiables) • R6 class instances (in R lists) not identifiable • StructuredDatatypes • R vectors with enforcement rules • Enumerations – each text value must come from a list of valid values • RegularExpressions– each text value must satisfy a regular expression • R vectors of DDI URN (e.g. ddi:urn:example.org:MyId17:1) • Relationships to objects inheriting from the Identifiable class (Identifiables) • Check for existence of referenced object – local or in Xcatalog, warn if not found • R Atomic Vectors • Primitives become simple R vectors (e.g. c(1,2,3) ) • Values just must match the atomic datatype (e.g. numeric, character, Boolean), enforced by normal R mechanisms Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Using the DDI4R package – A StructuredDatatype vString <- DDI4_ValueString$new(content="M", whiteSpace="Replace") R6 Object “vString” R6 Class “DDI4_ValueString” Multiple properties for the class Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Code Object Initialization Example (an Identifiable) Initialization vString <- DDI4_ValueString$new(content="M", whiteSpace="Replace") MyCode <- DDI4_Code$new(representation=list(vString)) MyCodeis Identifiable A ValueStringStructured Datatype Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Code and Category Initialization and Active Binding Example Initialization vString <- DDI4_ValueString$new(content="M", whiteSpace="Replace") MyCode <- DDI4_Code$new(representation=list(vString)) Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Attaching Metadata to Data: R attributes • R objects can have attributes as key-value pairs • The value can be other R objects • Example: • For a variable named “ivX” • Attach a persistent link to the variable’s metadata to the dataframe “df” • Retrieve a piece of that metadata # attach the metadata using the DDI URN attr(df,"DDIivX_DdiUrn") <- DDIivX$DdiUrn # get the DDI URN DDIivX_DdiUrn <- attr(df,"DDIivX_DdiUrn") # retrieve metadata using the DDI URN getObjectFromDdiUrn(DDIivX_DdiUrn)$unitOfMeasurement Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Identifiable Objects vs the others • Identifiable DDI4R objects have: • One or more R names. These are keys in an R environment. The R names cannot be counted on to persist • One globally unique, persistent identifier – the DDI URN. • DDI4R Structured Datatypes, Enumerations, and RegularExpressions : • Can have one or more R names. These are keys in an R environment. The R names cannot be counted on to persist • When one of these R6 objects is used as a property of an Identifiable, it becomes part of the structure of the identifiable, even if the original R name of the StructuredDatatype goes away. Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Support Functions The package provides a number of functions that work with DDI4R objects. These import DDI4 XML, export DDI4 XML, validate DDI4R object references, search for DDI URNs in an XML Catalog, and manage the local registry. Future functions could output DDI in other bindings, generate codebooks and scripts for other platforms. Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Tooltips in RStudio Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
R Help Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
R Help Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
R Help Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Attaching DDI4R objects to R Native Objects make a test dataset df<-data.frame(x=c(1,2,3), y=c('one','two','three'), stringsAsFactors=FALSE) #define InstanceVarible metadata for potential dataframe DDIivX<-DDI4_InstanceVariable$new(id="DDIivX", unitOfMeasurement="FooBars") DDIivY<-DDI4_InstanceVariable$new(id="DDIiDivY", unitOfMeasurement="count") #retrieve the DDI URN for the ivX metadata DDIivX_DdiUrn <- DDIivX$DdiUrn DDIivX_DdiUrn Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Attaching DDI4R objects to R Native Objects # attach the metadata using the DDI URN attr(df, "DDIivX_DdiUrn")<-DDIivX$DdiUrn attr(df,"DDIivY_DdiUrn")<-DDIivY$DdiUrn attributes(df) Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Attaching DDI4R objects to R Native Objects DDIivX_DdiUrn<- attr(df, "DDIivX_DdiUrn") DDIivX_DdiUrn getObjectFromDdiUrn(DDIivX_DdiUrn)$unitOfMeasurement Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
Attaching DDI4R objects to R Native Objects #make a test dataset df<-data.frame(x=c(1,2,3), y=c('one','two','three'), stringsAsFactors=FALSE) #define InstanceVarible metadata for potential dataframe DDIivX<-DDI4_InstanceVariable$new(id="iDivX", unitOfMeasurement="FooBars") ivY<-DDI4_InstanceVariable$new(id="iDivY", unitOfMeasurement="count") #retrieve the DDI URN for the DDIivX metadata DDIivX_DdiUrn <- DDIivX$DdiUrn DDIivX_DdiUrn # attach the metadata using the DDI URN attr(df,"DDIivX_DdiUrn")<-DDIivX$DdiUrn attr(df,"ivY_DdiUrn")<-ivY$DdiUrn attributes(df) # retrieve the metadata DDIivX_DdiUrn <- attr(df,"DDIivX_DdiUrn") DDIivX_DdiUrn getObjectFromDdiUrn(DDIivX_DdiUrn)$unitOfMeasurement Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
What Have We Gotten Done? • Production Process • Reads DDI4 model xmi • Writes an R6 Class Definition for each DDI4 Class • Multiple helper functions • Local registry management for R6 identifiables • XCatalog DDI URN lookup • Converts over DDI4 documentation to R package format • Import DDI4 XML • creates R6 objects, and writes an R script that can reproduce them • Export DDI4 XML Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
What Have We Learned? • An automated production process from the DDI4 XMI is possible • This could be incorporated as “official” or • External user communities could develop similar processes for other languages • Having the DDI4 model in XMI is useful • The basic infrastructure for DDI4 in R works • The complexity of the DDI4 model is reflected in the R model • Several things contribute: e.g. multiple language support, complex datatypes • Uncovered several quirks in the DDI4 model • Found issues with round-tripping from the DDI4 XML Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
What Next? • Human Readable Codebooks • PDF • RTF? • Import and export to other bindings • DDI4 RDF • DDI2.x and 3.x • JSON? • User Interface Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language
User Interface • Data Born in R • User prompting for structure • Nicer / interactive reporting • Shiny? • JavaScript extensions? Larry Hoyle and Joachim Wackerow, Implementing the DDI4 Model in the R Language