400 likes | 524 Views
Data Grids, Digital Libraries, and Persistent Archives. Reagan W. Moore San Diego Supercomputer Center http://www.npaci.edu/DICE moore@sdsc.edu. Archive Definition. Computer science - archive is the hardware and software infrastructure used to manage data
E N D
Data Grids, Digital Libraries, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center http://www.npaci.edu/DICE moore@sdsc.edu
Archive Definition • Computer science - archive is the hardware and software infrastructure used to manage data • Preservation community - archives is the material that is being preserved
Persistent Archive • Software system that manages evolution of the hardware and software infrastructure • A persistent archive preserves the authenticity and integrity of digital entities while the underlying technology evolves • Combination of the material that is being preserved and the infrastructure used to preserve the material
Data Grid • Grid Community definition • The infrastructure used to manage distributed data as a collection • Digital library and preservation community definition • The distributed data that is being organized and managed as a collection • A data grid is a mechanism to support sharing of data and the collection that is being shared
Data Sharing • Management of access controls on local resources to share data • Put controls on resources • Creation of a collection that is being shared across distributed resources • Put controls on collection • The SRB data grid does both, enacts controls on both resources and on collections (data and metadata)
Topics • Data Grids - managing distributed data • Distributed data management for a project • Digital Libraries - publication of data • Management of collection hierarchies • Persistent Archives - preservation of data • Management of technology evolution • Storage Resource Broker example • Currently supporting all three (seven) data management environments
Data Management Systems(Supported by Storage Resource Broker) • Data collecting • Sensor systems, object ring buffers and portals • Data organization • Collections, manage data context • Data sharing • Data grids, manage heterogeneity of resources • Data publication • Digital libraries, support discovery • Data preservation • Persistent archives, manage technology evolution • Data analysis • Processing pipelines, manage knowledge extraction
Data Management Systems • Data grid for managing distributed data • Latency management for bulk analyses of collections • Infrastructure independent name spaces for describing data, resources, users, and state information • Digital library for managing data context • Curation services for managing collections • Descriptive metadata for discovery • Persistent archive to manage technology evolution • Interoperability mechanisms between heterogeneous storage systems and user access mechanisms
Provide Context for Data • Properties of files • Provenance - source • Descriptive attributes • Structure • Organize properties as metadata in a collection hierarchy • Define operations on file properties • Manage state information - location, replicas, containers • Separate context management from content management • Maintain consistency of context as operations are done on content
Data Grids • Software systems that manage distributed data • Control global name spaces for • Resources • Users • Files • Metadata context • Provide standard operations on each name space • Provide single sign-on authentication, collection management, latency management, replication, and federation • Generic distributed data management technology
Managing Distributed Data Data Access Methods (Web Browser, DSpace, OAI-PMH) • Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access constraints Naming conventions provided by storage systems
Data Grids Provide a Level of Indirection for Each Naming Convention Data Access Methods (C library, Unix, Web Browser) Data Collection • Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access constraints • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints Data is organized as a collection
Logical Name Spaces • Storage resources • Logical names for managing collections of resources • User names (user-name / domain / data grid) • Distinguished names for users to manage access controls • Digital Entities (files, blobs, structured data, …) • Logical name space for global identifiers for files • Context - Metadata attributes • Standard metadata attributes, Dublin Core • State information resulting from data grid operations • User-defined metadata
Logical Resource Name • Represents a list of physical resources • Operations on the logical resource name result in operations on the list of physical resources • Load leveling -write to the next physical resource in the list • Fault tolerance - write to “k” of “n” physical resources • Replication - write to each physical resource • Compound resource - write to the disk cache in front of the tape archive • Federated resource - write to the controlled resource in another data grid
Storage Repository Virtualization User Application How does one access data stored on multiple systems? Database File System Archive
Storage Repository Virtualization(Standard Operations on Logical Resource Names) Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Collective operations Load leveling Fault tolerance Replication User Application Common set of operations for interacting with every type of storage repository Database File System Archive
Logical File Name Abstraction How does one identify files stored on multiple systems? User Application Database At U Md File System at NARA Archive at SDSC
Context Abstraction Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system User Application Common naming convention and set of attributes for describing digital entities Database At U Md File System at U Texas Archive at SDSC
Federated Server Architecture Peer-to-peer Brokering Read Application Parallel Data Access Logical Name Or Attribute Condition 1 6 5/6 SRB server SRB server 3 4 5 SRB agent SRB agent 2 Server(s) Spawning R1 MCAT 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control R2 Data Access
SRB Latency Management Remote Proxies, Staging Data Aggregation Containers Prefetch Network Destination Network Destination Source Caching Client-initiated I/O Streaming Parallel I/O Replication Server-initiated I/O
Latency Management -Bulk Operations • Bulk register • Create a logical name for a file • Bulk load • Create a copy of the file on a data grid storage repository • Bulk unload • Provide containers to hold small files and pointers to each file location • Bulk delete • Mark as deleted in metadata catalog • After specified interval, delete file • Bulk metadata load • Requests for bulk operations for access control setting, …
Data Grid Federation • Link multiple independent data grids • Coordinate metadata between independent metadata catalogs • Provide consistency and access constraints for each of the four logical name spaces (resources, users, files, metadata) • Peer-to-peer federations, data access • Replication federations, shared resources • Hierarchical federations, consistency constraints • Tune data grid federation by implementing different consistency and access constraints
Federation Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Collection A Data Collection B • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints Access controls and consistency constraints on cross registration of digital entities
Peer-to-Peer Data Grids Free Floating Partial User-ID Sharing Replication Constraints Consistency Constraints Occasional Interchange PartialResource Sharing Replicated Data No Metadata Synch Hierarchical Zone Organization One Shared User-ID System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing Resource Interaction Nomadic Access Constraints System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing User and DataReplica System Managed Replication Connection From Any Zone Complete Resource Sharing Snow Flake Super Administrator Zone Control Replicated Catalog Master Slave Replication Data Grids System Controlled Complete Synch No User-ID Sharing Federation Environments Deep Archive Hierarchical Data Grids
Generic Infrastructure • SDSC developed the Storage Resource Broker (SRB) to support access to distributed data • Effort started in 1996 as a DARPA funded project • Now support over 30 national/international projects • Development team of 12 staff is led by • Michael Wan, data management systems • Arcot Rajasekar , information management systems
Data Grid Capabilities • Data manipulation • Containers • Parallel I/O • Firewall interactions • Resource interactions • Fault tolerance • Load leveling • Replication • HIPAA security requirements • Authentication of all users • Access controls on data and metadata • Audit trails • Data encryption • Centralized control • Application interfaces • C library, Shell commands, Java, Perl, Python, WSDL, workflow
Digital Library • Collection hierarchy for organizing data • User-defined metadata • Collection level metadata • Metadata manipulation • Schema extension • Bulk metadata processing • Queries on metadata • Access controls on metadata • Views on collections • Digital library APIs • DSpace, Fedora, OAI-PMH, web browsers • METS metadata XML schema
Persistent Archives • Authenticity metadata • Provenance • User logical name space • Integrity metadata • Audit trails, checksums • Access controls • Consistency • Context update on all content operations • Persistency • Infrastructure independence • Storage repository abstraction • Information repository abstraction • Access abstraction (standard operations)
National Archives Persistent Archive NARA U Md SDSC MCAT MCAT MCAT Principle copy stored at NARA with complete metadata catalog Replicated copy at U Md for improved access, load balancing and disaster recovery Deep Archive at SDSC, no user access, but complete copy
C, C++, Java Libraries Unix Shell Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Data Grid Federation - zoneSRB Application HTTP DSpace OpenDAP OAI, WSDL, WSRF DLL / Python, Perl Linux I/O Java, NT Browser Kepler Actors Federation Management Consistency & Metadata Management / Authorization,Authentication,Audit Latency Management Metadata Transport Logical Name Space Data Transport Catalog Abstraction Storage Repository Virtualization Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix ORB
Examples of Extensibility • Storage Repository Driver evolution • Initially supported Unix file system • Added archival access - UniTree, HPSS • Added FTP/HTTP • Added database blob access • Added database table interface • Added Windows file system • Added project archives - Dcache, Castor, ADS • Added Object Ring Buffer, Datascope • Adding GridFTP version 3.3 • Database management evolution • Postgres • DB2 • Oracle • Informix • Sybase • mySQL (most difficult port - no locks, no views, limited SQL)
Examples of Extensibility • The 3 fundamental APIs are C library, shell commands, Java • Other access mechanisms are ported on top of these interfaces • API evolution • Initial access through C library, Unix shell command • Added iNQ Windows browser (C++ library) • Added mySRB Web browser (C library and shell commands) • Added Java (Jargon) • Added Perl/Python load libraries (shell command) • Added WSDL (Java) • Added OAI-PMH, OpenDAP, DSpace digital library (Java) • Added Kepler actors for dataflow access (Java) • Adding GridFTP version 3.3 (C library)
Grid Interfaces • GSI, support versions 1, 2, 3, Java • GridFTP version 3.3 interface to SRB collection • Use GSI certificate to identify the user to the SRB • Reference file by a SRB logical name space • Use SRB access controls for allowed operations • Initially support serial transport • SRB supports 4 different firewall interaction protocols (client-driven parallel I/O, server-driven parallel I/O, bulk file registration, federated data grid access) • GridFTP version 3.3 driver for SRB collection • Store data at a remote site under the SRB ID • Data will be shareable through SRB access controls\ • Store data at a remote site under user GSI certificate • Data will not be shareable through SRB access controls
Grid Interfaces • Replica Location Service Interface • Simon Metson <s.metson@bristol.ac.uk> • GMCat mimics the LRC interface, enabling the files registered in an MCat to appear on the giggle framework (RLS). • Available from http://tuber1.phy.bris.ac.uk:8080/GMCatWS3 • (also linked from the third party software on the SRB page) • Storage Resource Manager • SRM Version 1, SRB driver created to store data in SRM • SRM Version 2, development effort to put SRM interface on top of SRB (Alasdair Earl) • SRM Version 3, development effort to put SRM interface on top of SRB (Peter Kunszt)
Conclusion • Distributed data management systems can be built on generic data grid infrastructure • Data grids to support bulk access across remote sites • Integration of data grid and digital library capabilities to manage massive data collections • Federation of data grids to build international discipline-wide collections
SDSC SRB Team(left to right) • Arun Jagatheesan • George Kremenek • Sheau-Yen Chen • Arcot Rajasekar (SRB development lead) • Reagan Moore (SRB PI) • Michael Wan (SRB architect) • Roman Olschanowsky (BIRN) • Bing Zhu • Charlie Cowart • Lucas Gilbert • Tim Warnock • Wayne Schroeder (SRB product) • Adam Birnbaum (SRB production) • Antoine De Torcy • Vicky Rowley (BIRN) • Marcio Faerman (SCEC) • Students & emeritus • Erik Vandekieft • Reena Mathew • Xi (Cynthia) Sheng • Allen Ding • Grace Lin • Qiao Xin • Daniel Moore • Ethan Chen • Jon Weinburg • Supported by about 20 projects (NSF, DOE, NASA, NARA, NIH, LOC, NHPRC)
For More Information Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE http://www.npaci.edu/DICE/SRB http://www.npaci.edu/dice/srb/mySRB/mySRB.html