430 likes | 608 Views
Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/. Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher
E N D
Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/
Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano XuFei Qian Roman Olshanowsky Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Graduate Students A. Bagchi S. Bansal A. Behere R. Bharath S. Bharath M. Kulrul L. Sui Undergraduate Interns N. Cotofana M. Shumaker J. Trang L. Yin +/- NN Data and Knowledge Systems Group
Topics • Application of • Data management systems • Information management systems • Knowledge management systems • to • Distributed data collections • Digital libraries • Data Grids • Persistent Archives • by • Defining levels of abstraction
Information Management Projects • Digital Libraries • CDL - AMICO • DARPA/USPTO - patent digital library • NLM Visible Embryo digital library - GMU • NSF Digital Library Initiative, Phase II - UCSB, Stanford • NSF NPACI Digital Sky - Caltech 2MASS sky survey • NSF NSDL - UCAR / Columbia / Cornell / UCSB • Data Grid Environments • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NASA Information Power Grid - NASA Ames • NIH Biomedical Informatics Research Network • NSF Grid Physics Network - U Florida • NSF National Virtual Observatory - Johns Hopkins University / Caltech • NSF Southern California Earthquake Center - ISI • Persistent Archives • NARA Persistent Archive • NHPRC - Archivist workbench
Managing Distributed Storage • Separate the organization of digital objects from their physical storage • Logical Name Space to manage attributes about the digital objects • Data handling system to manage interactions with remote storage systems • Create storage abstraction layer • Storage Resource Broker (SRB) provides data management system
Information Management- Logical Name Space • Set of attributes to describe digital entities that are registered into the logical name space • SRB metadata - Unix file system semantics • Provenance metadata - Dublin Core • Resource metadata - User access control lists • Discipline metadata - User defined attributes • Each digital entity may have unique attributes
Information Management • Abstraction layer for interacting with information repositories • Manage the schema and physical table structures of a database • Extensible schema • User defined attributes • Extensible Metadata CATalog (EMCAT) manages collections • mySRB.html interface supports dynamic collection creation
Knowledge Management - Discovery across Collections • Characterization of relationships between attributes • Semantic / logical - cross-walks • Procedural / temporal - records management • Structural / spatial - GIS • Abstraction layer for knowledge repositories • Mapping from collection attributes to discipline concepts • Model-based Mediation supports mapping from knowledge relationships to rule-based inference engines
Presentation of Digital Objects Application Operating System Storage System Display System Digital Object
Technology Management Application Wrap Application Operating System Storage System Display System Digital Object
Technology Management Application Add Operating System Call Operating System Storage System Display System Digital Object
Technology Management Application Add Operating System Call Operating System Add Operating System Call Storage System Display System Digital Object
Technology Management Application Add Operating System Call Operating System Wrap Storage System Wrap Display System Storage System Display System Digital Object
Technology Management Application Operating System Storage System Display System Migrate Encoding Format Digital Object
Specifying levels of Abstraction • Technology management becomes simpler if the persistent archive infrastructure operates on abstractions, rather than an explicit physical implementation of a resource • Can we abstract • Digital object • Storage
Technology Management Application Operating System Storage System Abstraction Display System Abstraction Storage System Display System Digital Object Abstraction Digital Object
Types of Digital Entity Abstractions • Logical representation • What does the digital entity represent? • What is the associated meaning? • Physical representation • What is the physical structure of the digital entity?
Levels of Abstraction for Bits Logical: I-nodes Physical: Track / Sector Abstraction for Digital Entity Digital Entity Bit Stream Abstraction for Repository Logical: File Name Physical: File System (NFS/AFS/NTFS) Repository Disk
Levels of Abstraction for Data Logical: Data Model (units, semantics) Physical: Encoding Format (syntax, structure) Abstraction for Digital Entity Digital Entity Files Abstraction for Repository Logical: Name Space Physical: Data Handling System -SRB/MCAT Repository File System, Archive
Levels of Abstraction for Information Logical: Collection Schema Physical: XML Syntax Abstraction for Digital Entity Digital Entity Metadata Attributes Abstraction for Repository Logical: Database Schema Physical: EMCAT/CWM Repository Database
Levels of Abstraction for Knowledge Logical: Relationship Schema Physical: ER/UML/XMI/ RDF syntax Abstraction for Digital Entity Concept Space (ontology instance) Digital Entity Abstraction for Repository Logical: Knowledge Repository Schema Physical: Model-based Mediation System Repository Knowledge Repository
Information Management Projects • Digital Libraries • CDL - AMICO • DARPA/USPTO - patent digital library • NLM Visible Embryo digital library - GMU • NSF Digital Library Initiative, Phase II - UCSB, Stanford • NSF NPACI Digital Sky - Caltech 2MASS sky survey • NSF NSDL - UCAR / Columbia / Cornell / UCSB • Data Grids • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NASA Information Power Grid - NASA Ames • NIH Biomedical Informatics Research Network • NSF Grid Physics Network - U Florida • NSF National Virtual Observatory - Johns Hopkins University / Caltech • NSF Southern California Earthquake Center - ISI • Persistent Archives • NARA Persistent Archive • NHPRC - Archivist workbench
Evolution of Data Management Collection - managed data Use database to organize attributes about data objects Separate information management from data storage Support APIs for information discovery, data access Database A Storage Storage Resource Broker Integration accomplished through a data handling system which characterizes the storage systems
Application Resource, User Java, NT Browsers Prolog Predicate C, C++, Linux I/O Unix Shell Third-party copy Web User Defined SRB Remote Proxies MCAT Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX HRM Dublin Core DataCutter Application Meta-data SDSC Storage Resource Broker & Meta-data Catalog
Evolution of Data Management Distributed Data Collection Same name space Same schema Separate administration domains Heterogeneous database instances Database A Database B Storage Resource Broker Integration requires the ability to characterize both the schemas and the table structures of each information repository
Distributed Data Collection • Logical organization of distributed digital objects into a collection • Access through federated servers • Collection-owned data, implies the server at each storage repository runs under a collection user-ID • Collection attributes define a global namespace • Self-consistent attribute update on all data accesses • Support for multiple access APIs • Extensible support for access to any type of storage system (archive, file system, database) • Extensible collection attributes
Interoperability across Data and Information Repositories • Define a representation for storage that is independent of the implementation of the storage system • Unix file system semantics - Open/Close/Read/Write/Seek • Define a representation of a collection that is independent of the choice of database • schema, table structures
Visible Embryo Project Disk Cache AFIP: Collab WS Image Generation OHSU Eolas GST ATD Net NIC Disk Cache UIC Startap ASX200 BEN MSWS NT WS MSWS NT WS Oakland HSCC WRL 100 Gbit Vegas OC-3 JHU Disk Cache DS3 Los Angeles VBNS OC-12 Abilene OC-3 GMU Disk Cache DC POP OC-3 Abilene OC-3 SDSC Archive
Data Grids Data Grid - linking multiple data collections Separate name spaces Separate schema Separate administration domains Heterogeneous database instances Database A Data grid Database B The data grid is itself a collection that provides mechanisms to hide latency and manage semantics
National Virtual Observatory Data Grid 1. Portals and Workbenches 2.Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Standard APIs and Protocols Concept space 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7.
Federated Digital Libraries Virtual Data Grid - linking multiple data collections Ability to execute processes to recreate derived data Database A Services Virtual Data Grid Database B Services The virtual data grid integrates data grid and digital library technology to manage processes
Portals & Clients Portals & Clients Portals & Clients NSDL Services NSDL Services Other NSDL Services NSDL Collections NSDL Collections NSDL Collections Core Services: annotation CI Services query transform CI Services topic-map registry referenced items & collections Core Services: metadata normalizing CI Services personalization referenced items & collections Referenced Items & Collections Core Collection- Building Services metadata harvesting CI Services discussion Core Collection- Building Services persistent storage CI Services visualization... User Interfaces NSDL Usage Enhancement Delivery Presentation Aggregation - Channels Information about collections Core NSDL Bus Meta-data delivery Data delivery Query Global Ids Security Network Metadata & data access-based services Virtual Collections & Mediators Collection Building
Persistent Archive Persistent archive Describe archived data as collections Describe processes used to create collections Manage evolution of technology Database A (today) Virtual Data Grid Database A (tomorrow) The persistent archive is itself a virtual data grid that provides mechanisms to manage migration to new technology
Persistent Archives • Storage system abstraction • Logical name space and data manipulations • Information repository abstraction • Logical schema and physical table structure • Knowledge repository abstraction • Topic maps and inference rules • Digital object abstraction • Data model and encoding format
Persistent Collection • Define context for archiving data -annotate information content • Create archivable form - standard encoding format • Archive information content along with data • Test closure of the collection - all digital objects that can be discovered in the collection are members of the collection • Test completeness of the collection - inherent relationships within the collection can be cast in terms of attributes generated from the annotated information. • Differentiate between inherent knowledge and anomalies / artifacts
Self-Instantiating Archive • Archive the processes that are used to control the ingestion process • Conversion to archivable form • Annotation of information content • When accessing the collection, retrieve the processes and the original digital objects • Apply the processing steps to re-create the information content • Query the result to discover desired digital objects • A self-instantiating archive is a virtual data grid
Data Management Systems • Distributed data collections • Single name space • Distributed data storage systems • Data Grid - integration of multiple data collections • Each collection has a separate name space • Infrastructure that interconnects the collections can use its own name space, containers, replication • Virtual Data Grids - federation of digital libraries • In addition, support interoperability between services for manipulation, presentation, discovery of digital objects • Persistent archive • In addition, manage evolution of technology components
Differentiating between Data, Information, and Knowledge • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional
Knowledge Management • Must manage semantic relationships between the multiple name spaces • Data Grid • Must manage procedural relationships between digital library services • Federated digital library • Must manage structural relationships between different archivable forms - encoding formats • Persistent archive
Types of Knowledge Relationships • Logical / semantic • Digital Library cross-walks • Temporal / procedural • Workflow systems • Spatial / structural • GIS systems • Functional / algorithmic • Scientific feature analysis
Knowledge Based Data Grids Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Model-based Access) XML DTD Information Repository Attribute- based Query Attributes Semantics SDLIP Information (Data Handling System - SRB) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF
Further Information http://www.npaci.edu/DICE