470 likes | 615 Views
Building Infrastructure for Data Management 25 April 2014 Larry Lannom Corporation for National Research Initiatives http://www.cnri.reston.va.us/. Corporation for National Research Initiatives. Three Part Talk. Organizing for Infrastructure: RDA
E N D
Building Infrastructure for Data Management 25 April 2014 Larry LannomCorporation for National Research Initiativeshttp://www.cnri.reston.va.us/ Corporation for National Research Initiatives
Three Part Talk • Organizing for Infrastructure: RDA • Building Infrastructure: Data Type Registries • Using Infrastructure: Deep Carbon Observatory,Handles/DOIs Corporation for National Research Initiatives
The Information Age – Extraordinary Potential for Driving Science and Bettering Society More Efficient PhysicalInfrastructure Contribution to a safer and more secure world Transformative strategies for disease treatment and well-being More goods and services More Research Insights
Key Driver 1: Data Sharing Accelerating Discovery and Innovation
Data Sharing is a Global Issue Libraries, Archives, Repositories, Museums Science, Humanities, Arts Communities Cyberinfrastructure professionals, data analysts, data center staff, … Data Scientists
Key Driver 2: Community Effort Accelerating Impact Development of public access shared data collection enabling new resultsfor Alzheimer’s Creation / adoption of data sharing policieshave accelerated research innovation Development and adoption ofshared parallel communication protocolsthrough the MPI Forum drove a generation of advances Now 25 years old, the Internet Engineering Task Force’s mission “to make the Internet work better” has resulted in key specifications of Internet common community standards that support innovation MPI Forum photo by ErezHeba, PDB molecule of the month at http://www.rcsb.org/pdb/home/home.do “Just do it”-- Focused efforts help communities drive tangible progress
Enabling Technologies ID ID ID ID ID ID 010001010 010011011 010101001 101010000 010001010 010011011 010101001 101010000 ID 010001010 010011011 010101001 101010000 ID ID ID ID ID Scientists, Data Curators, End Users, Applications Datasets
Enabling Technologies ID ID ID ID ID 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. ID ID ID ID ID ID ID ID ID ID ID ID ID Scientists, Data Curators, End Users, Applications Datasets Accessed via Repositories
Enabling Technologies Enabling Technologies ID ID ID ID ID 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. Discovery ID ID ID ID ID ID ID ID ID ID ID ID ID Scientists, Data Curators, End Users, Applications Datasets Accessed via Repositories
Discovery & Evaluation • Search • Metadata registries • Subject • Parties • Dates • Etc • Crawlers – more ad hoc • Citation • Formats • Permissions • Can I see it? • Can I use it? • Trust Corporation for National Research Initiatives
Enabling Technologies Enabling Technologies ID ID ID ID ID 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. Discovery ID ID ID ID ID ID ID ID Access ID ID ID ID ID Scientists, Data Curators, End Users, Applications Datasets Accessed via Repositories
Access • ID / reference resolution • Access Protocols • How to get it • Protocol registries • Bootstrapping into new protocols • Authentication & Authorization • Proof of identity (tradeoff: usability vs security) • Permissions: with the object or in some external system? Corporation for National Research Initiatives
Enabling Technologies Enabling Technologies ID ID ID ID ID 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. Discovery ID ID ID ID ID ID ID ID Access ID ID ID ID ID Scientists, Data Curators, End Users, Applications Interpretation Datasets Accessed via Repositories
Interpretation • Registries • Schemas • Vocabularies • Formats • Available services • Useful client-side tools • Trust • Who did this? • Who owns this? • Provenance • Data Source • Processing steps • Computing environment • what is needed to trust the numbers? • Domain specific? Corporation for National Research Initiatives
Enabling Technologies Enabling Technologies ID ID ID ID ID 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. 0100 0101.. Discovery ID ID ID ID ID ID ID ID Access ID ID ID ID ID Scientists, Data Curators, End Users, Applications Interpretation Datasets Accessed via Repositories Reuse
Reuse • Everything from Interpretation slide + Permissions • Example: I need to understand a data set for peer review but that doesn’t give me permission to use the data • Validation • Education & Training • Integrate ‘live’ data into education and training • Repurpose data Corporation for National Research Initiatives
The Research Data Alliance (RDA) • Global community-driven organization launched in March 2013 to accelerate data-driven innovation • RDA focus is on building the social, organizational and technical infrastructure to • reduce barriers to data sharing and exchange • accelerate the development of coordinated global data infrastructure RESEARCHDATA ALLIANCE
RDA Vision and Mission • Research Data Alliance Mission:RDA builds the social and technical bridges that enable data sharing. • Research Data Alliance Vision:Researchers and innovators openly share data across technologies, disciplines, and countries to address the grand challenges of society.
Goal of RDA Infrastructure: Support Data Sharing and Interoperability Across Cultures, Scales, Technologies • Common data types for data Interoperability • Persistent identifiers • Domain-focused portals • Harmonized standards • Data access and preservation policy and practice • Tools for data discoverability, … Harmonized standards Policy and Practice
CREATE ADOPT USE RDA Members come together as • Working Groups – 12-18 month efforts to build, adopt, and use specific pieces of infrastructure • Interest Groups – longer-lived discussion forums that spawn Working Groups as specific pieces of needed infrastructure are identified. • Working Group efforts focus on the development and use of data sharing infrastructure • Code, policy, infrastructure, standards, or best practices that are adopted and used by communities to enable data sharing • “Harvestable” efforts for which 12-18 months of work can eliminate a roadblock • Efforts that have substantive applicability to groups within the data community, but may not apply to everyone • Efforts for which working scientists and researchers can start today
RDA Plenary 1 / Launch March 2013 in Gothenburg, Sweden 240 participants 3 WG, 9 IG RDA Plenary 2 September 2013 in Washington, DC 380 participants 6 WG, 17 IG, 5 BOF RDA Plenary 3 March 2014 in Dublin, Ireland 497 participants 12 WG, 22 IG, 14 BOF 6 co-located events RDA Plenary 4 Sept 2014 in Amsterdam RDA Plenaries: Venue for community building and WG / IG progress Plenary 1 Plenary 2 Plenary 3 Fran Berman
RDA Plenaries Emerging as a Data Community “Town Square” Emerging Plenary Format: • All-hands sessions: Place for community networking and exchange of information (funding agencies, data organizations, key stakeholders) • Working sessions: Face-to-face opportunities for global Interest Groups, Working Groups, and BOFs to meet and advance their agendas • Neutral meeting place: Place for multiple groups to meet and form a common agenda and action plan (e.g. Plenary 2 Data Citation Harmonization Summit)
Precipitous Growth First Org.Assembly 6 co-located events 14 BOF, 12 Working Groups, 22 Interest Groups 497 participants First “neutral space” community meeting (Data Citation Summit) First Org. Partner Meet-up First BOFs 380 participants from 22 countries First Working Groups and Interest Groups 240 participants Amsterdam RDA Launch / First Plenary March 2013 RDA Second Plenary September 2013 RDA Third Plenary March 2014 RDA Fourth Plenary September 2014
RDA Community Evolving Rapidly:Over 1500 members from 70+ countries (as of 3/15/14) Africa 2% SouthAmerica 1% Map courtesy traveltip.org Asia 4% Austral-pacific 4%
RDA Interest (IG) and Working Groups (WG) effectively doubling each Plenary (Groups as of 1/14) Community Needs - focused • Community Capability Model IG • Engagement IG • Clouds in Developing Countries IG Domain Science - focused • Toxicogenomics Interoperability IG • Structural Biology IG • Biodiversity Data Integration IG • Agricultural Data Interoperability IG • Digital History and Ethnography IG • Defining Urban Data Exchange for Science IG • Marine Data Harmonization IG • Materials Data Management IG Reference and Sharing - focused • Data Citation IG • Data Categories and Codes WG • Legal Interoperability IG Data Stewardship - focused • Research Data Provenance IG • Certification of Digital Repositories IG • Preservation e-infrastructure • Long-tail of Research Data IG • Publishing Data IG • Domain Repositories IG • Global Registry of Trusted Data Repositories and Services IG Base Infrastructure - focused • Data Foundations and Terminology WG • Metadata Standards WG • Practical Policy WG • PID Information Types WG • Data Type Registries WG • Metadata IG • Big Data Analytics IG • Data Brokering IG
RDA Organizational Frameworknearly at Steady State RDA Council Responsible for overarching mission, vision, impact of RDA Technical Advisory Board Responsible for Technical roadmap and interactions Secretary-General and Secretariat Responsible for administration and operations Organizational Advisory Board and Organizational Assembly Responsible for organizational and strategic advice RDA Membership Working GroupsResponsible for impactful, outcome-oriented efforts Interest GroupsResponsible for defining and refining common issues RDA Colloquium (Research Funders)Operational and community sponsorship
Coming in Fall: First RDA Infrastructure Deliverables Scheduled to Complete Summer 2014 Data Type Registries WG • Deliverables: System of data type registries, formal model for describing types, working model of a registry. • Initial Adopters and Users: CNRI, International DOI Foundation, Deep Carbon Observatory Practical Code Policies • Deliverables: Survey of policies in production use, testbed of machine actionable policies, deployment of 5 policy sets, policy starter kits • Initial Adopters and Users: RENCI, DataNet Federation Consortium, CESNET, Odum Institute Persistent Identifier Information Types • Deliverables: Minimal set of PID types, API • Initial Adopters and Users: Data Conservancy, DKRZ Scheduled to Complete Fall 2014 Language Codes • Deliverables: Operationalization of ISO language categories for repositories. • Initial Adopters and Users: Language Archive, Paradisec Data Foundations and Terminology • Deliverables: Common vocabulary for data terms, formal definitions and open registry for data terms • Initial Adopters and Users: EUDAT, DKRZ, Deep Carbon Observatory, CLARIN, EPOS Metadata Standards • Deliverables: Use cases and prototype director of current metadata standards starting from DCC directory • Initial Adopters and Users: JISC, DataOne
RDA Medium Term (3-5 year) Goals • Create a pipeline of data sharing infrastructure efforts • that are adopted and used by communities during their development • that increase their impact through greater adoption over time • Build and expand the research data community for effective impact • globally, regionally, and within constituent groups • Evolve as a useful, relevant, and agile organization • that helps the community capitalize on opportunity and respond to challenges within the data community
RDA as an Accelerant of Existing Projects • This is already the case • RDA is helping expand the impact of at least two Sloan-funded projects. • CNRI Interoperability Platform • LEI Prototype • Type Registry • Deep Carbon Observatory (DCO) • Data science infrastructure (RPI) • DCO now working with CNRI in the context of the RDA Data Type Registries Working Group Corporation for National Research Initiatives
What are Data Types? • Characterize data structures at multiple levels of granularity • Serve as macro or shortcut for understanding and processing data • File formats & mime types are examples of solved problems at the container level but don’t solve finer grained interpretation • It’s a number in cell A3 but what does it mean • Other structures with more limited use, e.g., many sci. data sets, may need multiple levels of typing • Data types enable humans and machines to discover, process, and reason about data Corporation for National Research Initiatives
Data Type Registries • Each type registered with unique identifier • Common data model and expression • Associate with services, tools, format registries, etc. • Common API for machine consumption Corporation for National Research Initiatives
RDA Data Type Registries WG • Goal: Interoperable set of Type Registries • Approved as RDA WG at Plenary 1 • Co-chairs • Larry Lannom – CNRI • Daan Broeder - Max Planck Institute for Psycholinguistics • Membership • 44 participants • U.S., UK, Netherlands, Germany, Italy, Australia, Finland, Canada, Kenya, Japan • Various scientific fields, Practitioners, Librarians, Publishers • Schedule • 3/2013 – 9/2013: gather use cases, begin design, including data model • 10/2013 – 12/2013: refine model, begin prototyping • 1/2014 – 5/2014: finalize data model & functional specs, deploy functional registry for Handle types, release turnkey registry Corporation for National Research Initiatives
DTR Use Cases • Broad Functional Classification • Repos hold widely varying levels of data & metadata • High-level functional classification of the identified object needed to make sense of what is available, e.g., data object, metadata, repo description, contact info, etc. • Simple License Information via PID Resolution • Data set access conditions cannot be predicted based on ID • For DataCite DOIs, a handle/type/value triple could be used to provide access information, probably through a level of indirection, resulting in a pop-up or intervening page or open linked data • Object Types as a Short-cut for Dependent Services to Match Processing Requirements to Data Objects • Using data acquisition as an example • Determine object type you are trying to build • Consult registry to index into an ontology to dynamically define required and optional properties • Does the input data have what is needed? • Registration of PID Types (in ID/Type/Value triples) for Data Processing and Interpretation • Distinguish pointers to objects from pointers to metadata from pointers to services • Enable complex client interactions as opposed to simple one-to-one re-direction Corporation for National Research Initiatives
Discovery Use Case ID ID Type ID ID Type Users ID Payload Type Type ID Payload Type Payload Payload Type Payload Payload 2 2 3 4 1 1 3 4 Repositories and Metadata Registries Federated Set of Type Registries Clients (process or people) look for types that match their criteria for data. For example, clients may look for types that match certain criteria, e.g., combine location, temperature, and date-time stamp. Type Registry returns matching types. Clients look up in repositories and metadata registries for data sets matching those types. Appropriate typed data is returned.
Process Use Case ID ID Type ID ID Type Users ID Payload Type Type Federated Set of Type Registries ID Payload Type Payload Payload Type Payload Payload 4 3 2 3 4 2 1 1 4 Typed Data Terms:… I Agree 10100 11010 101…. Visualization Rights Data Set Dissemination Data Processing Services Client (process or people) encounters unknown type. Resolved to Type Registry. Response includes type definitions, relationships, properties, and possibly service pointers. Response can be used locally for processing, or, optionally Typed data or reference to typed data can be sent to service provider.
Deep Carbon Observatory Data Science and Data Management Infrastructure Overview
Global research program to transform our understanding of carbon in Earth • Community of scientists --- biologists, physicists, geoscientists, chemists, and many others --- whose work crosses these disciplinary lines, forging a new, integrative field of deep carbon science • 10-year initiative to intensify global attention and scientific effort in the burgeoning field of deep carbon science • DCO infrastructure includes: public engagement and education, online and offline community support, innovative data management, and novel instrumentation deepcarbon.net
Alfred P. Sloan Foundation pledged $50 million over the duration to fund: infrastructure development, scientific workshops, novel technology development, and preliminary research and fieldwork. • “Seed funding” awarded to catalyze collaborative scientific efforts around the world, increase public and private sector spending in deep carbon science, and leave a thriving community of international scientists as its legacy. • DCO will synthesize 10 years of scientific research to generate unique and unprecedented views of Earth, looking at both scientific and human societal issues through a new, sharper lens. deepcarbon.net
DCO-Data Science World View: Everything is a first-class (science) object deepcarbon.net
Entry point for DCO object registration and deposit deepcarbon.net
DTR Use Cases • Broad Functional Classification • Repos hold widely varying levels of data & metadata • High-level functional classification of the identified object needed to make sense of what is available, e.g., data object, metadata, repo description, contact info, etc. • Simple License Information via PID Resolution • Data set access conditions cannot be predicted based on ID • For DataCite DOIs, a handle/type/value triple could be used to provide access information, probably through a level of indirection, resulting in a pop-up or intervening page or open linked data • Object Types as a Short-cut for Dependent Services to Match Processing Requirements to Data Objects • Using data acquisition as an example • Determine object type you are trying to build • Consult registry to index into an ontology to dynamically define required and optional properties • Does the input data have what is needed? • Registration of PID Types (in ID/Type/Value triples) for Data Processing and Interpretation • Distinguish pointers to objects from pointers to metadata from pointers to services • Enable complex client interactions as opposed to simple one-to-one re-direction Corporation for National Research Initiatives
RDA Brings Together DCO & DTR • Benefits to DTR • DCO brought the data acquisition use case – no one else thought of it • DCO as early adopter will benefit testing and use of RDA result • Benefits to DCO • Needed facility specified and prototyped with DCO use case in mind • Turn-key DTR will be available to DCO • DCO data science approaches and accomplishments presented to wide multi-disciplinary audience • Benefits to Sloan • Two funded projects each augmented through interaction in RDA Corporation for National Research Initiatives
Types and the Handle System • Typing makes sense of data, which is just bits • Handles resolve to type/value pairs – all other functions reside in the applications • Handles identify digital entities which are implicitly or explicitly typed • So – to develop Handle-based applications • Must understand the types of returned values • Will at some point need to understand the downstream data identified by handles Corporation for National Research Initiatives
Example DTR Use Cases • Broad Functional Classification • Repos hold widely varying levels of data & metadata • High-level functional classification of the identified object needed to make sense of what is available, e.g., data object, metadata, repo description, contact info, etc. • Simple License Information via PID Resolution • Data set access conditions cannot be predicted based on ID • For DataCite DOIs, a handle/type/value triple could be used to provide access information, probably through a level of indirection, resulting in a pop-up or intervening page or open linked data • Object Types as a Short-cut for Dependent Services to Match Processing Requirements to Data Objects • Using data acquisition as an example • Determine object type you are trying to build • Consult registry to index into an ontology to dynamically define required and optional properties • Does the input data have what is needed? • Registration of PID Types (in ID/Type/Value triples) for Data Processing and Interpretation • Distinguish pointers to objects from pointers to metadata from pointers to services • Enable complex client interactions as opposed to simple one-to-one re-direction Corporation for National Research Initiatives
What do Data Type Records contain? • Data type records contain • textual description for human understanding • provenance information (who created when and what) • Records could contain • structured metadata about types for machines to process • encoding information (think file formats) • service information (think APIs to systems or applications that can process typed data) • semantic information (think description or predicate logic, useful for reasoning) • Records do not enforce or define new ways to describe or represent data structures, but rely on existing frameworks and technologies • File formats (mime types), etc., may be used for describing encoding information • WSDL, REST APIs, etc., may be used for describing service information • OWL, KIF, etc., may be used for representing semantics and knowledge Corporation for National Research Initiatives
Proposed Data Type Data Model Corporation for National Research Initiatives
Proposed Use of Data Types • Multiple type registries will be deployed; perhaps one per community • Type registries federate across each other; local policies may restrict (the scope of) such federation • Users register data structures within a type registry and acquire a unique, persistent identifier (data type) • Data type identifiers are then associated with corresponding data • Registered type records are additionally disseminated by type registries as Linked Data compatible outputs • General Guidelines • Users decide what data structures to register or not. If a data structure is expected to play a global role, then users are encouraged to register that data structure • Users are encouraged to first search if the data structure is registered prior to registering to avoid duplicates • Users decide the encoding, service, and semantic technology or framework that best suits them Corporation for National Research Initiatives