260 likes | 397 Views
Data Management Services Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu http://www.npaci.edu/DICE/. Staff Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta
E N D
Data Management Services Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu http://www.npaci.edu/DICE/
Staff Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Students - GSRA Martin Kuhl Liying Sui Yang Yu Valter Crescenzi Students - Undergrad Interns Peter Shin Roman Olshanowsky Shabbar Tambawala Pratik Mukhopadhyay +/- NN Data Intensive Computing Environment Group
Topics • Data management systems • Examples of large-scale data management • Characterization of data, information, and knowledge for digital libraries
Evolution of Data Management Collection - managed data Use database to organize attributes about data objects Separate information management from data storage Support APIs for information discovery, data access Database A Storage Storage Resource Broker Integration accomplished through a data handling system which characterizes the storage systems
Evolution of Data Management Distributed Data Collection Same name space Same schema Separate administration domains Heterogeneous database instances Database A Database B Storage Resource Broker Integration requires the ability to characterize both the schemas and the table structures of each information repository
Data Grids Data Grid - linking multiple data collections Separate name spaces Separate schema Separate administration domains Heterogeneous database instances Database A Data grid Database B The data grid is itself a collection that provides mechanisms to hide latency and manage semantics
Astronomy Sky Survey Data Grid 1. Portals and Workbenches 2.Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Standard APIs and Protocols Concept space 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7.
Federated Digital Libraries Virtual Data Grid - linking multiple data collections Ability to execute processes to recreate derived data Database A Services Virtual Data Grid Database B Services The virtual data grid integrates data grid and digital library technology to manage processes
Portals & Clients Portals & Clients Portals & Clients NSDL Services NSDL Services Other NSDL Services NSDL Collections NSDL Collections NSDL Collections Core Services: annotation CI Services query transform CI Services topic-map registry referenced items & collections Core Services: metadata normalizing CI Services personalization referenced items & collections Referenced Items & Collections Core Collection- Building Services metadata harvesting CI Services discussion Core Collection- Building Services persistent storage CI Services visualization... User Interfaces NSDL Usage Enhancement Delivery Presentation Aggregation - Channels Information about collections Core NSDL Bus Meta-data delivery Data delivery Query Global Ids Security Network Metadata & data access-based services Virtual Collections & Mediators Collection Building
Persistent Archive Persistent archive Describe archived data as collections Describe processes used to create collections Manage evolution of technology Database A (today) Virtual Data Grid Database A (tomorrow) The persistent archive is itself a virtual data grid that provides mechanisms to manage relationships over time
Data Management Systems • Distributed data collections • Single name space • Distributed data storage systems • Data Grid - integration of multiple data collections • Each collection has a separate name space • Infrastructure that interconnects the collections can use its own name space, containers, replication • Virtual Data Grids - federation of digital libraries • In addition, support interoperability between services for manipulation, presentation, discovery of digital objects • Persistent archive • In addition, manage evolution of technology components
Distributed Environment Hurdles • Access to data distributed across multiple administration domains • Access to local name spaces • Persistence / consistency of distributed digital objects • Latency hiding mechanisms
Distributed Data Collection • Logical organization of distributed digital objects into a collection • Access through federated servers • Collection-owned data, implies the server at each storage repository runs under a collection user-ID • Collection attributes define a global namespace • Self-consistent attribute update on all data accesses • Support for multiple access APIs • Extensible support for access to any type of storage system (archive, file system, database) • Extensible collection attributes
Logical Collections • Separate the organization of digital objects into a collection from their physical storage location • Metadata catalog to manage attributes about the digital objects • Data handling system to manage interaction with remote storage systems
Interoperability across Data and Information Repositories • Define a representation for storage that is independent of the implementation of the storage system • Unix file system semantics - Open/Close/Read/Write/Seek • Define a representation of a collection that is independent of the choice of database • XML DTD defining schema, table structures
Defining Collection Attributes • Composing schema - define sets of attributes that are needed for each collection function • SRB metadata - Unix file system semantics • Provenance metadata - Dublin Core • Resource metadata - User access control lists • Discipline metadata - User defined attributes
C, C++, Linux I/O Unix Shell SRB Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Application Resource, User Java, NT Browsers Prolog Predicate Third-party copy Web User Defined Remote Proxies MCAT HRM Dublin Core DataCutter Application Meta-data
Latency Management • Data streaming • Overlap I/O access time with data movement • Data caching • Create a local copy to minimize I/O access time • Data replication • Choose between multiple sources for data access • Data aggregation • Use containers to hold multiple small data sets • I/O aggregation • Use remote proxies to do remote filtering/data subsetting
Minimizing Latency in I/O Pipes Data Aggregation Remote Proxies Staging Streaming Caching Replication Network Destination Source
Knowledge Management • Must manage semantic relationships between the multiple name spaces • Data Grid • Must manage procedural relationships between digital library services • Federated digital library • Must manage structural relationships between different versions of software systems • Persistent archive
Differentiating between Data, Information, and Knowledge • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional
Types of Knowledge Relationships • Logical / semantic • Digital Library cross-walks • Temporal / procedural • Workflow systems • Spatial / structural • GIS systems • Functional / algorithmic • Scientific feature analysis
Knowledge Based Digital Libraries Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Model-based Access) Information Repository Attribute- based Query XML DTD Attributes Semantics SDLIP Information (Data Handling System - SRB) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF
Information Management Projects • Digital Libraries • NSF Digital Library Initiative, Phase II - UCSB, Stanford • Digital Embryo digital library - GMU • NPACI Digital Sky - Caltech 2MASS sky survey • CDL - AMICO • NSF NSDL - UCAR / DLESE • Grid Environments • NASA Information Power Grid - NASA Ames • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NSF Grid Physics Network - U Fl • Persistent Archives • NARA Persistent Archive • NHPRC - Scalable archives
Further Information http://www.npaci.edu/DICE