Status report from the TWG/CCIT to the CEWG

Status report from the TWG/CCITto the CEWG 2009-08-12 Dave Vieglais and Ryan Scherle

TWG Overview • Activity • Two Meetings • Weekly (or so) telecon • Time contributions by Duane and Mark • Significant outcomes thus far • Project infrastructure (plone sites, svn) • Use cases, interactions, requirements • Discussions, especially Identifiers and Identity • Student projects

Architecture • Process (Meta-architecture > Conceptual A. > Logical A.) • Use Cases • Functional requirements • Interfaces and interactions • Prototyping • Core pieces • Fluff • Iterative process (somewhat) • Identify and resolve issues at all stages • Limited resources - so important to get design right

Use Cases • Identified major categories and obvious use cases early • Subsequently expanded to 34 or so • Diagrams developed to illustrate interactions for each use case • Capture desired system functional requirements • APIs identified, getting more stable

Use Case Issues • UC2 "Get list of GUIDs from metadata search" • Can queries be done at MN with equivalent results? • Where is result filtering based on access privileges performed? • Authentication issue - if search across many nodes, then where is identity resolved • UC3 "Registration of a new member node" • Should new nodes be registered with specified trust levels?

Use Case Issues (2) • UC4,5 "Create/Update/Delete metadata record in Member Node." • What is the policy on archival copies of data and metadata? (Can data packages be deleted? Published packages modified?) • UC12 "User Authentication - Person via client software authenticates against Identify Provider to establish session token." • Where is identity stored? MN? CN? Combination of all?

Use Case Issues (3) • UC24 "Transactions - CNs and MNs should support transaction sets where operations all complete successfully or get rolled back (e.g., upload both data and metadata records)." • Do transactions span multiple MNs, CNs? • UC27 "CN should support forward migration of metadata documents from one version to another within a standard and to other standards." • 20+ metadata standards • How to handle lossy conversions?

Use Case Issues (4) • UC28 "Relationships/Versioning - Derived products should be linked to source objects so that notifications can be made to users of derived products when source products change." • Who asserts these relationships? How are relationships managed? • UC31 "Manage Access Policies - Client can specify access restrictions for their data and metadata objects. Also supports release time embargoes." • Group management has an important, perhaps unusual temporal component.

Coordinating Node Requirements • CNs provide a central role in infrastructure ∴ critical to identify functional and non-functional requirements early • Non-exhaustive list of 21 requirements (so far) • e.g.: • “Coordinating Node services should be designed to be independently scalable.” • “Data packages are not discoverable through any public interface until all Coordinating Nodes have confirmed that they have a copy of the corresponding metadata document.” • “Metadata searches should return in a maximum of “xxx” seconds.”

General conclusions The member nodes come with a diverse set of technologies and practices. The coordinating nodes will need to be very permissive while providing quality services. History/versioning: • Keep all versions of metadata, so we can see where it came from (and metadata doesn't take much storage) • The original data package should always be stored. Transformed versions may be needed for some operations of the coordinating nodes. • It may be too much of a burden to store all versions of a data file.

Identity, Authentication, Authorization • MN & CN security services necessary to • preserve and verify integrity of data packages (in D1) • prevent malicious intent or inappropriate access • Six identity / security models in industry: • Centralized (LDAP) • Distributed directories (LDAP + referrals) • Distributed management and replication (LDAP + replication) • Grid Security Infrastructure proxy certificates • Open ID • Shibboleth + InCommon

Identities Types of users: • non-authenticated user • registered user (at member node) • registered user (DataONE central) • group member • site manager (for harvests, system operations, etc.) • change request approval workflow • owner of intellectual property rights Privileges: • access/modify both data and metadata • Member Node Write • create/execute system functions • access logged information

Metadata Standards • 20 or so relevant standards DC, DwC, EML, CSDGM, GCMD-DIF, ISO 19137:2007, NeXML, WaterML, Genbank-FFF, ISO 19115, GML, CDF, DDI, GEML, ESML, CSR, ESG, ECHO, ... • Conversion between standards is a lossy process • Issues of compatibility in metadata storage across MNs • Original metadata will be stored unchanged • Need to define metadata standard that will be used to support search and discovery operations (CN)

Search Terms • Principal Investigator/Author • Keywords • Key Concept (drawn from ontologies) • Spatial bounding box • Spatial window (series of spatial envelopes) • Named places • Temporal window • Abstract / full text • Title • Data format • Scientific variables • Subject domain • Biological taxonomic extents • Associated publications • Data source • Related data • Data quality • Organization domains • Size of data • Number, location of replicas • Data dimensionality • Scientific units • Globally unique identifier • Object permissions

Identifiers • Fundamental component of entire architecture • Many schemes (handle, LSID, PURL, ...), each with advantages and faults • Not practical for DataONE to dictate single identifier scheme across all Member Nodes • Feasible to require that identifiers are unique across all participating MNs • However, not feasible to assume that all MNs will support all identifier schemes • Key question: Must an identifier always resolve to the same sequence of bits? Or should it be more abstract?

Prototypes By November 2009 meeting (hmm...): • Member Node contributes metadata to Coordinating Node using GUID • CN initiates replication of data object from MN to MN • Logging for instrumentation and usage • Update data object (revision) by Member Node Others targets, in order of importance: • Replication of metadata and system information between CNs • Failover and load balancing between CNs • Formalize all service API specs. using a language agnostic IDL • Comparison and evaluation of existing systems/standards/protocols used by prototype implementations • Authentication and authorization using LDAP (initial impl.) • Search portal user interface using Coordinating Node metadata content • Heartbeat/state of health services • Registry services using, perhaps, a simple list as an initial method • Stress and load testing

Current activities • Wrapping up this year’s student internships. • Addressing the general questions arising out of the use case diagrams (some of these questions will be discussed at the coordination meeting) • Developing a report on identifier usage. • Creating APIs to be used in prototypes.

Hurdles • Resources & Contributors • Identity, authentication, authorization • Identifiers • Rules for data handling and archive (what is data?) • Metadata extraction • CN replication

Feedback from CEWG What is the vision for access management to DataONE, and how much of that will be left up to member nodes? • Answer: Data providers must "establish trust" to publish/modify content. • What does "establish trust" entail? Is there a technical component? • Who are “data providers”? The member nodes or the end users?

Open Questions • What policies should we have for managing DataONE documents? • What properties should we enforce regarding identifiers? • What are the minimum requirements for a member node to join the DataONE community? Or, how accommodating should we be? • Can we identify some member nodes that will implement all best practices and serve as models for the other member nodes? • How much data should we expect to handle? It is unclear what the uptake curve will be, but this has major implications for our architectural planning. • Do we want/need a registry of name spaces for identifiers? • Is it reasonable to store replicas using the ID scheme of the secondary member node, as long as the coordinating nodes are capable of resolving the original identifier to the correct location? • What types of access control should be allowed? • What time constraints are we under?

Open Questions (2) • Can the CEWG produce some science-oriented use cases that augment our current technical cases? • Will member nodes be willing to use central DataONE services and/or create adapters that allow their services to communicate with DataONE? • Are there technologies that are widely used across the member node community? If so, these would be promising targets, as we could create a small number of adapters that could be used for a large number of member nodes. • What are the high-value member nodes, for which we must provide custom adapters?

Status report from the TWG/CCIT to the CEWG