390 likes | 566 Views
Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu. Staff Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek
E N D
Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu
Staff Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Students - GSRA Martin Kuhl Liying Sui Yang Yu Valter Crescenzi Students - Undergrad Interns Peter Shin Roman Olshanowsky Shabbar Tambawala Pratik Mukhopadhyay +/- NN Data Intensive Computing Environment
Research and Development Activities - FY00 • Demonstration of scalable systems • Expansion of persistent archive Framework • Knowledge-based persistent archives • Demonstration of archivable forms for new types of data • Web, GIS, compound documents, collections • Knowledge and anomaly processing • Tightness of fit of XML DTDs • Self validating archives as a preservation strategy
Topics • Persistent archive functionality • Characterization of • Data / Information / Knowledge • Integration of Digital Library, Grid environments, and Persistent Archives
Persistent Archive • Manage digital objects for the “life of the republic” • Maintain ability to discover and access digital objects while supporting hardware and software systems evolve
Fundamental Concept for a Persistent Archive • Persistence requires migration over time onto new technology • While the migration occurs, a persistent archive must be able to interoperate with both the old technology and the new technology. • A persistent archive is an interoperability system.
Implicit Concepts for Persistent Archive • Infrastructure independence • Data set access • Authentication • Collection management • Presentation • Non-proprietary formatting • Information models • XML - Information markup language • GML - Graphics markup language • Support for ingestion, management, access • Accessioning workbench, archive, access workbench
Standard Information Markup Language • XML representation of metadata attributes • Standardization of DTDs - MOA II DTD for text • Standardization of markup language • XML based representation of collection structure • Attributes defining the physical layout of a schema into relational tables (foreign keys, attribute data types, …) • XML databases & XML organized data collections • Commercial systems: Excelon, TAMINO, Oracle8i, • XML based Topic Maps • Represent relationships between collection domain concepts, collection attibutes
E-mail Collection • Test of the scalability of the technology • Archived a one-million record E-mail collection (1999) • Ingestion • Tagged E-mail using XML syntax (6 required, 13optional, 1000 user-defined tags) • Created description of the collection • Aggregated E-mail into containers, stored in an archive • Retrieved collection description, created database, and optimized for query • Total time was 27 hours (used 10 Mbit/sec Ethernet)
What Types of Interoperability are Needed? • Data management (digital objects) • Ability to work with multiple types of storage systems, across separate administration domains • Information management (attributes) • Ability to define a collection independent of database choice • Ability to migrate collection onto new databases • Knowledge management (relationships) • Ability to manage relationships • Ability to map domain concepts to collection attributes
Simplest Definitions • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional
Types of Knowledge Relationships • Logical / semantic • Digital Library cross-walks • Temporal / procedural • Workflow systems • Spatial / structural • GIS systems • Functional / algorithmic • Scientific feature analysis
Data Archive Ingest Services Management Access Services Ingestion platform Data repositories Access platform Interoperability Standards Interoperability Protocols
Collection Based Persistent Archive Ingest Services Management Access Services Information Repository Attribute- based Query Attributes Semantics SDLIP Information XML DTD (Data Handling System - SRB / FTP / HTTP) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF
Knowledge Based Persistent Archive Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Topic Maps / Buckets / Model-based Access) Information Repository Attribute- based Query Attributes Semantics SDLIP Information XML DTD (Data Handling System - SRB / FTP / HTTP) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF
Ingestion Processes for Collection Creation Accession Template Closure Concept/Attribute Attribute Inverse Indexing Information Generation Knowledge Generation Attribute Selection Attribute Tagging Occurrence Tagging View Management Data Organization Collection
Examples of Implied KnowledgeSenate Legislative Activities • Structural knowledge • Pertinent information embedded in document headers • Procedural knowledge • Naming convention • Senator represented by last name • Senator represented by last name and state • Senator represented by last name, first name, and state • Collection knowledge • Referenced senators include senators no longer in the senate
Knowledge Generation • Accessioning Template • Defines the concepts under which the data objects will be tagged and organized • Attribute selection • Define the attributes that represent the information content associated with the domain concepts • Tag attributes using minimal constraint language, such as XML or XMLSchema • Evaluate closure of mined attributes compared to expected attributes • Refine concept map
Information Generation • Create occurrence index • (Occurrence, attribute, value) • This is needed to be able to recreate original form of digital object • Analyze completeness of information • Inverse index of attribute values • Identifies unexpected values - consistency • Analyze closure of collection • Are additional attributes needed to represent inverse index value ranges?
Data Organization • Archive preferred views of collection • Original data • XML tagged representation • Minimal representation of consolidated information • ‘Noise-free’version based upon occurrence tags • Object-relational database version • Archive occurrence tagged view • Archive ingestion procedures that transform collection from the original digital objects to the preferred views
Information Management Projects • Digital Libraries • NSF Digital Library Initiative, Phase II - UCSB, Stanford • Digital Embryo digital library - GMU • NPACI Digital Sky - Caltech 2MASS sky survey • CDL - AMICO • NSF NSDL - UCAR / DLESE • Grid Environments • NASA Information Power Grid - NASA Ames • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NSF Grid Physics Network - U Fl • Persistent Archives • NARA Persistent Archive • NHPRC - Scalable archives
File SID DBLobj SID Obj SID SRB Unix DB2 Oracle ADSM HPSS Data Handling System SDSC Storage Resource Broker & Meta-data Catalog Application Resource Third-party copy User Remote Proxies MCAT Dublin Core DataCutter Application Meta-data
1. NVO Portals and Workbenches NVO Data Grid 2. Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Concept space Standard APIs and Protocols 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7. Data model, schema, services, and mapping to NVO concept space published into (2) when a collection joins the federation
Persistent Archive Framework • Persistent archive functionality - Accessioning platform • Data management - Archive Markup Language (AML), Container management • Collection management - Validation of collection, collection characterization • Knowledge management - Workflow staging, procedure management for ingestion process, anomaly detection, characterization of inherent implied knowledge • Scale - collections of millions to billions of objects
Globus Data Grid Architecture Appln Discipline-Specific Data Grid Application Coherency control, replica selection, task management, virtual data catalog, virtual data code catalog, … User Replica catalog, replica management, co-allocation, certificate authorities, metadata catalogs, Collective Access to data, access to computers, access to network performance data, … Resource Communication, service discovery (DNS), authentication, authorization, delegation Connect Storage systems, clusters, networks, network caches, … Fabric
Persistent Archive Framework • Persistent archive functionality - Repository • Data management - Storage system (robot, media, caching software), media migration, disaster recovery (archive namespace to container mapping) • Collection management - Container to object mapping, object metadata storage • Knowledge management - Transaction logging, AML migration on access or on media migration • Scale - thousands of collections, billions of objects, petabytes of data
Globus Protocols, Services, and Interfaces Occur at Each Level Applications Languages/Frameworks User Service APIs and SDKs User Service Protocols User Services Collective Service APIs and SDKs Collective Service Protocols Collective Services Resource APIs and SDKs Resource Service Protocols Resource Services Connectivity APIs Connectivity Protocols Local Access APIs and Protocols Fabric Layer
Persistent Archive Framework • Persistent archive functionality - Access platform • Data management - Data caching, container caching, disk cache management • Information management - Collection instantiation, access query, browsing support • Knowledge management - Order processing and workflow tracking, product authentication, usage characterization, presentation management • Scale - Millions of accesses per day
Application “Specialized services”: user- or appln-specific distributed services Application User Internet Protocol Architecture “Managing multiple resources”: ubiquitous infrastructure services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Globus Layered Grid Architecture(By Analogy to Internet Architecture)
Persistent Archive Framework • Persistent archive functionality - ARC • Data management - Finding aid storage • Collection management - Catalog of collections, access query, browse, disaster backup mechanisms, collection discriptors • Knowledge management - Characterization of finding aid efficiency, presentation management, concept spaces spanning collections • Scale - thousands of collections
referenced items & collections referenced items & collections Referenced Items & Collections Portals & Clients Portals & Clients Portals & Clients NSDL Services NSDL Services Other NSDL Services NSDL Collections NSDL Collections NSDL Collections Core Services: annotation CI Services query transform CI Services topic-map registry Core Services: metadata normalizing CI Services personalization Core Collection- Building Services metadata harvesting CI Services discussion Core Collection- Building Services persistent storage CI Services visualization... User Interfaces Usage Enhancement Delivery Presentation Aggregation - Channels Information about collections Core NSDL Bus Meta-data delivery Data delivery Query Global Ids Security Network Metadata & data access-based services Virtual Collections & Mediators Collection Building
Cross Cutting Issues • Global namespace • Metadata used by data handling system to locate containers • Metadata used to characterize objects in containers • Metadata used to characterize collections • Metadata used to locate collections • Consistency of metadata while updating
Cross Cutting Issues • Knowledge management • Workflow systems to monitor state of system, monitor transactions, monitor updates to system architecture, monitor consistency of global namespace • Data distribution • Caching of data between accessioning platform, archive, and access platform • Consistency during updates
Cross Cutting Issues • Security • Authentication across platforms • Authorization across platforms for updates • Consistency of architecture • Audit trails for updates • Validation of integrity of system • State management for system components
Research Challenges- 2000 • Infrastructure independence • Progress on archivable form creation • Digital paper • Finding aids for a million collections • Concept spaces that support identification of collection • Product authentication • Tracking all updates, movements, media migrations, collection instantiations • Choice of Archival Markup Language • Tracking of E-commerce implementations • Knowledge management systems • Workflow, ingestion processing steps, system evolution procedures, finding aid concept spaces
Further Information http://www.npaci.edu/DICE