1 / 23

Status report from the TWG/CCIT to the CEWG

Status report from the TWG/CCIT to the CEWG. 2009-08-12 Dave Vieglais and Ryan Scherle. TWG Overview. Activity Two Meetings Weekly (or so) telecon Time contributions by Duane and Mark Significant outcomes thus far Project infrastructure (plone sites, svn)

dewey
Download Presentation

Status report from the TWG/CCIT to the CEWG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Status report from the TWG/CCITto the CEWG 2009-08-12 Dave Vieglais and Ryan Scherle

  2. TWG Overview • Activity • Two Meetings • Weekly (or so) telecon • Time contributions by Duane and Mark • Significant outcomes thus far • Project infrastructure (plone sites, svn) • Use cases, interactions, requirements • Discussions, especially Identifiers and Identity • Student projects

  3. Architecture • Process (Meta-architecture > Conceptual A. > Logical A.) • Use Cases • Functional requirements • Interfaces and interactions • Prototyping • Core pieces • Fluff • Iterative process (somewhat) • Identify and resolve issues at all stages • Limited resources - so important to get design right

  4. Use Cases • Identified major categories and obvious use cases early • Subsequently expanded to 34 or so • Diagrams developed to illustrate interactions for each use case • Capture desired system functional requirements • APIs identified, getting more stable

  5. Use Case Issues • UC2 "Get list of GUIDs from metadata search" • Can queries be done at MN with equivalent results? • Where is result filtering based on access privileges performed? • Authentication issue - if search across many nodes, then where is identity resolved • UC3 "Registration of a new member node" • Should new nodes be registered with specified trust levels?

  6. Use Case Issues (2) • UC4,5 "Create/Update/Delete metadata record in Member Node." • What is the policy on archival copies of data and metadata? (Can data packages be deleted? Published packages modified?) • UC12 "User Authentication - Person via client software authenticates against Identify Provider to establish session token." • Where is identity stored? MN? CN? Combination of all?

  7. Use Case Issues (3) • UC24 "Transactions - CNs and MNs should support transaction sets where operations all complete successfully or get rolled back (e.g., upload both data and metadata records)." • Do transactions span multiple MNs, CNs? • UC27 "CN should support forward migration of metadata documents from one version to another within a standard and to other standards." • 20+ metadata standards • How to handle lossy conversions?

  8. Use Case Issues (4) • UC28 "Relationships/Versioning - Derived products should be linked to source objects so that notifications can be made to users of derived products when source products change." • Who asserts these relationships? How are relationships managed? • UC31 "Manage Access Policies - Client can specify access restrictions for their data and metadata objects. Also supports release time embargoes." • Group management has an important, perhaps unusual temporal component.

  9. Coordinating Node Requirements • CNs provide a central role in infrastructure ∴ critical to identify functional and non-functional requirements early • Non-exhaustive list of 21 requirements (so far) • e.g.: • “Coordinating Node services should be designed to be independently scalable.” • “Data packages are not discoverable through any public interface until all Coordinating Nodes have confirmed that they have a copy of the corresponding metadata document.” • “Metadata searches should return in a maximum of “xxx” seconds.”

  10. General conclusions The member nodes come with a diverse set of technologies and practices. The coordinating nodes will need to be very permissive while providing quality services. History/versioning: • Keep all versions of metadata, so we can see where it came from (and metadata doesn't take much storage) • The original data package should always be stored. Transformed versions may be needed for some operations of the coordinating nodes. • It may be too much of a burden to store all versions of a data file.

  11. Identity, Authentication, Authorization • MN & CN security services necessary to • preserve and verify integrity of data packages (in D1) • prevent malicious intent or inappropriate access • Six identity / security models in industry: • Centralized (LDAP) • Distributed directories (LDAP + referrals) • Distributed management and replication (LDAP + replication) • Grid Security Infrastructure proxy certificates • Open ID • Shibboleth + InCommon

  12. Identities Types of users: • non-authenticated user • registered user (at member node) • registered user (DataONE central) • group member • site manager (for harvests, system operations, etc.) • change request approval workflow • owner of intellectual property rights Privileges: • access/modify both data and metadata • Member Node Write • create/execute system functions • access logged information

  13. Metadata Standards • 20 or so relevant standards DC, DwC, EML, CSDGM, GCMD-DIF, ISO 19137:2007, NeXML, WaterML, Genbank-FFF, ISO 19115, GML, CDF, DDI, GEML, ESML, CSR, ESG, ECHO, ... • Conversion between standards is a lossy process • Issues of compatibility in metadata storage across MNs • Original metadata will be stored unchanged • Need to define metadata standard that will be used to support search and discovery operations (CN)

  14. Search Terms • Principal Investigator/Author • Keywords • Key Concept (drawn from ontologies) • Spatial bounding box • Spatial window (series of spatial envelopes) • Named places • Temporal window • Abstract / full text • Title • Data format • Scientific variables • Subject domain • Biological taxonomic extents • Associated publications • Data source • Related data • Data quality • Organization domains • Size of data • Number, location of replicas • Data dimensionality • Scientific units • Globally unique identifier • Object permissions

  15. Identifiers • Fundamental component of entire architecture • Many schemes (handle, LSID, PURL, ...), each with advantages and faults • Not practical for DataONE to dictate single identifier scheme across all Member Nodes • Feasible to require that identifiers are unique across all participating MNs • However, not feasible to assume that all MNs will support all identifier schemes • Key question: Must an identifier always resolve to the same sequence of bits? Or should it be more abstract?

  16. Prototypes By November 2009 meeting (hmm...): • Member Node contributes metadata to Coordinating Node using GUID • CN initiates replication of data object from MN to MN • Logging for instrumentation and usage • Update data object (revision) by Member Node Others targets, in order of importance: • Replication of metadata and system information between CNs • Failover and load balancing between CNs • Formalize all service API specs. using a language agnostic IDL • Comparison and evaluation of existing systems/standards/protocols used by prototype implementations • Authentication and authorization using LDAP (initial impl.) • Search portal user interface using Coordinating Node metadata content • Heartbeat/state of health services • Registry services using, perhaps, a simple list as an initial method • Stress and load testing

  17. Current activities • Wrapping up this year’s student internships. • Addressing the general questions arising out of the use case diagrams (some of these questions will be discussed at the coordination meeting) • Developing a report on identifier usage. • Creating APIs to be used in prototypes.

  18. Hurdles • Resources & Contributors • Identity, authentication, authorization • Identifiers • Rules for data handling and archive (what is data?) • Metadata extraction • CN replication

  19. Feedback from CEWG What is the vision for access management to DataONE, and how much of that will be left up to member nodes? • Answer: Data providers must "establish trust" to publish/modify content. • What does "establish trust" entail? Is there a technical component? • Who are “data providers”? The member nodes or the end users?

  20. Open Questions • What policies should we have for managing DataONE documents? • What properties should we enforce regarding identifiers? • What are the minimum requirements for a member node to join the DataONE community? Or, how accommodating should we be? • Can we identify some member nodes that will implement all best practices and serve as models for the other member nodes? • How much data should we expect to handle? It is unclear what the uptake curve will be, but this has major implications for our architectural planning. • Do we want/need a registry of name spaces for identifiers? • Is it reasonable to store replicas using the ID scheme of the secondary member node, as long as the coordinating nodes are capable of resolving the original identifier to the correct location? • What types of access control should be allowed? • What time constraints are we under?

  21. Open Questions (2) • Can the CEWG produce some science-oriented use cases that augment our current technical cases? • Will member nodes be willing to use central DataONE services and/or create adapters that allow their services to communicate with DataONE? • Are there technologies that are widely used across the member node community? If so, these would be promising targets, as we could create a small number of adapters that could be used for a large number of member nodes. • What are the high-value member nodes, for which we must provide custom adapters?

More Related