430 likes | 445 Views
Explore how DataONE bridges diverse data repositories and enables new science through its core cyberinfrastructure. Learn about its components, functionality, and the DataONE Federation's Investigator Toolkit.
E N D
DataONE Cyberinfrastructure Overview USGS Workshop April, 2012
Data deluge Sensors, sensor networks, and remote sensing gather observations; Data management and stewardship Photo courtesy of www.carboafrica.net
The long tail of orphan data “Most of the bytes are at the high end, but most of the datasets are at the low end”– Jim Gray Specialized repositories (e.g. GenBank, PDB) Volume Orphan data (B. Heidorn) Rank frequency of datatype
Data entropy Time of publication Specific details General details Retirement or career change Information Content Accident Death Time (Michener et al. 1997)
DataONE vision and approach Enable new science and knowledge creation through easy access to data about life on earth and the environment that sustains it, plus access to key tools. Build on existing cyberinfrastructure Create new cyberinfrastructure Support communities of practice
Data Access Platform • DataONE offers a platform that bridges across heterogeneous existing and new repositories to provide consistent, reliable access to diverse data • Operates core services necessary to maintain platform consistency • Existing tools and techniques modified to work with a single common data layer instead of a multitude
Enabling Functionality Fundamentals of the core cyberinfrastructure: • Identifiers • Preservation • Identity • Discovery
DataONE Cyberinfrastructure Three major components for a flexible, scalable, sustainable network • Member Nodes • diverse institutions • serve local community • provide resources for managing their data • retain copies of data • Coordinating Nodes • retain complete metadata catalog • indexing for search • network-wide services • ensure content availability (preservation) • replication services Investigator Toolkit
Three Major Components Investigator Toolkit Client Libraries Web Interface Analysis, Visualization Data Management Command Line Java Python Member Nodes Coordinating Nodes Service Interfaces Service Interfaces Resolution Discovery Tier 1 – Read only, Public Replication Registration Tier 2 – Read only, Auth-z Identifiers Catalog Tier 3 – Read, Write Preservation Monitor Tier 4 – Replication target Auth-z Identity Service Bridge Data Repository Object Store Index
1. Coordinating Nodes • Object tracking and replica management • High availability • Performance • Scalable architecture • Java on Tomcat, Hazelcast, SOLR • Leveraging Metacat Investigator Toolkit Member Nodes Coordinating Nodes Service Interfaces Resolution Discovery Replication Registration Identifiers Catalog Preservation Monitor Auth-z Identity Object Store Index
2. Member Nodes • Data storage • Data access • Access control • Replication • Metadata quality • Primary user interaction Investigator Toolkit Coordinating Nodes Member Nodes Service Interfaces Tier 1 – Read only, Public Tier 2 – Read only, Auth-z Tier 3 – Read, Write Tier 4 – Replication target Service Bridge Data Repository
Member Node Functional Tiers • Tier 1: Read only, public content ping(), getLogRecords(), getCapabilities(),get(), getSystemMetadata(), getChecksum(),listObjects(), synchronizationFailed() • Tier 2: Read only, with access control isAuthorized(), systemMetadataChanged() • Tier 3: Read/Write using client tools create(), update(), delete() • Tier 4: Able to operate as a replication target • replicate(),getReplica() • http://mule1.dataone.org/ArchitectureDocs-current/apis/MN_APIs.html
3. The Investigator Toolkit • Developer, end-user tools • Creation, search, retrieval, management • Plugins, extensions for analysis tools Investigator Toolkit Client Libraries Web Interface Analysis, Visualization Data Management Command Line Java Python Member Nodes Coordinating Nodes Kepler
Investigator Toolkit Activities DMP-Tool Kepler
Libraries and CLI • Client libraries available in Java and Python (+bash) • Low-level direct interaction with service endpoints • Higher level abstraction (e.g. data packages) • Command Line Client (CLI) • Interact with DataONE platform from command line • One-shot or command shell operation • Intended for developers or “technical” users
Using the DataONE R Client Initialize client object d1 <- D1Client() Resolve, download, and convert data dataPackage <- getD1Object(d1, "erd.362.1") erd.train.locs <- asDataFrame(dataPackage,1) Store model results on Member Node d1object <- createD1Object(d1, dataId, doc_char, format, mn_nodeid) d1object$create() d1object$setPublicAccess()
ONE Mercury • Data discovery tool • Enables search and retrieval of content indexed by DataONE • Primary web based user interface for DataONE • Operates on each Coordinating Node • Same SOLR / Lucene index is utilized by other client tools
ONE Mercury Architecture Catalog for Earth Observations BDP NBII EML LTER EML NCEAS Internal Metadata Index FGDC ORNL DAAC Metadata Extraction EML OBFS Data Centers / Member Nodes FGDC IAI DIF LP DAAC • Single portal • Numerous search capabilities • Search sharing functions (RSS, Web Services) • Metadata has link to data, which reside at Data Center FGDC LBA Stored at Coordinating Node EML I-LTER EML TERN EML SAEON
Others • ONEDrive • Excel add-in • Morpho metadata editor • Workflow tools
Technical activities • DataONE Architecture • USGS data contributions through Member Node(s) • USGS data access through DataONE • Investigator Toolkit access, plug-ins • Data Management Planning Tool • EzID DOI Service • USGS technology leveraging (i.e. Clearinghouse, etc.) • USGS Metadata leveraging (i.e. Training, tools, etc.)
Investigator Toolkit Development • Work in more robust metadata options for USGS CDI Tools • Implement FuseFS & Dokan (network file system) to enable DataONE Toolkit interaction with loaded datasets • Contribute ScienceBase components as plugins to DataONE • Expose ArcGIS to DataONE APIs for data access, contributions Investigator Toolkit ScienceBase
DataONE Drive – Windows implementation • Concept: Virtual drive integrated into the OS which allows for scientists to access & deposit data • DataONE has developed a MAC/Linux OS implementation • USGS is working on a Windows OS implementation
Component Communications • HTTPS • Representational State Transfer (REST) end points • XML encoded messages • Message structures defined by schema MN MN Investigator Tools CN CN CN
Application Programming Interfaces • Coordinating Node • Core • Read • Authorization • Identity • Replication • Register • Member Node • Core • Read • Authorization • Storage • Replication
Data Model Package Package Package SystemMetadata SystemMetadata 1 1 ScienceMetadata Data 1 1 n n Any data object XML documents: ISO19115, EML, FGDC, … ResourceMap SystemMetadata 1 1 OAI-ORE RDF
Access Control DataONE Platform MN Retrieve MN Replication Create MN Manage Synchronization Discover Investigator Tools CN CN CN Indexing Authentication Identity CN Replication Identity Providers CILogon
Targets for Initial Public Release • Operational core infrastructure • Three coordinating nodes: • ORC, UCSB, UNM • Eight member nodes: • KNB SANParksAvian Knowledge Network • Dryad ORNL DAAC • MerrittUSGS • PISCOLTER • Essential investigator toolkit components: • Search interface (ONE Mercury) • ONE R-plugin • Developer tools in in Python and Java • Design and component documentation
DataONE Team and Sponsors • EwaDeelman • Amber Budden, Roger Dahl, Rebecca Koskela, Bill Michener, Robert Nahf, Mark Servilla • Peter Honeyman • Dave Vieglais • Suzie Allard, Carol Tenopir, MaribethManoff, Robert Waltz, Bruce Wilson • Jeff Horsburgh • John Cobb, Bob Cook, GiriPalanismy, Line Pouchard • Bertram Ludaescher • Robert Sandusky • Patricia Cruse, John Kunze • Sky Bristol, Mike Frame, Richard Huffine, VivHutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly • Peter Buneman • Chad Berkley, Stephanie Hampton, Matt Jones • David DeRoure • Paul Allen, Rick Bonney, Steve Kelling • Carole Goble • Ryan Scherle, Todd Vision • Donald Hobern • Randy Butler • Cliff Duke LEON LEVY FOUNDATION
Resources • Architecture Docs : http://mule1.dataone.org/ArchitectureDocs-current • Operations Docs: http://mule1.dataone.org/OperationDocs/ • Component Docs: Distributed with component • Source Code Repository https://repository.dataone.org/ • Community DUG – DataONE Users Group Developers, CCIT, Working Groups