Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center

Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu

Staff Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Students - GSRA Martin Kuhl Liying Sui Yang Yu Valter Crescenzi Students - Undergrad Interns Peter Shin Roman Olshanowsky Shabbar Tambawala Pratik Mukhopadhyay +/- NN Data Intensive Computing Environment

Research and Development Activities - FY00 • Demonstration of scalable systems • Expansion of persistent archive Framework • Knowledge-based persistent archives • Demonstration of archivable forms for new types of data • Web, GIS, compound documents, collections • Knowledge and anomaly processing • Tightness of fit of XML DTDs • Self validating archives as a preservation strategy

Topics • Persistent archive functionality • Characterization of • Data / Information / Knowledge • Integration of Digital Library, Grid environments, and Persistent Archives

Persistent Archive • Manage digital objects for the “life of the republic” • Maintain ability to discover and access digital objects while supporting hardware and software systems evolve

Fundamental Concept for a Persistent Archive • Persistence requires migration over time onto new technology • While the migration occurs, a persistent archive must be able to interoperate with both the old technology and the new technology. • A persistent archive is an interoperability system.

Implicit Concepts for Persistent Archive • Infrastructure independence • Data set access • Authentication • Collection management • Presentation • Non-proprietary formatting • Information models • XML - Information markup language • GML - Graphics markup language • Support for ingestion, management, access • Accessioning workbench, archive, access workbench

Standard Information Markup Language • XML representation of metadata attributes • Standardization of DTDs - MOA II DTD for text • Standardization of markup language • XML based representation of collection structure • Attributes defining the physical layout of a schema into relational tables (foreign keys, attribute data types, …) • XML databases & XML organized data collections • Commercial systems: Excelon, TAMINO, Oracle8i, • XML based Topic Maps • Represent relationships between collection domain concepts, collection attibutes

E-mail Collection • Test of the scalability of the technology • Archived a one-million record E-mail collection (1999) • Ingestion • Tagged E-mail using XML syntax (6 required, 13optional, 1000 user-defined tags) • Created description of the collection • Aggregated E-mail into containers, stored in an archive • Retrieved collection description, created database, and optimized for query • Total time was 27 hours (used 10 Mbit/sec Ethernet)

What Types of Interoperability are Needed? • Data management (digital objects) • Ability to work with multiple types of storage systems, across separate administration domains • Information management (attributes) • Ability to define a collection independent of database choice • Ability to migrate collection onto new databases • Knowledge management (relationships) • Ability to manage relationships • Ability to map domain concepts to collection attributes

Simplest Definitions • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional

Types of Knowledge Relationships • Logical / semantic • Digital Library cross-walks • Temporal / procedural • Workflow systems • Spatial / structural • GIS systems • Functional / algorithmic • Scientific feature analysis

ANATOM

Data Archive Ingest Services Management Access Services Ingestion platform Data repositories Access platform Interoperability Standards Interoperability Protocols

Collection Based Persistent Archive Ingest Services Management Access Services Information Repository Attribute- based Query Attributes Semantics SDLIP Information XML DTD (Data Handling System - SRB / FTP / HTTP) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF

Knowledge Based Persistent Archive Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Topic Maps / Buckets / Model-based Access) Information Repository Attribute- based Query Attributes Semantics SDLIP Information XML DTD (Data Handling System - SRB / FTP / HTTP) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF

Ingestion Processes for Collection Creation Accession Template Closure Concept/Attribute Attribute Inverse Indexing Information Generation Knowledge Generation Attribute Selection Attribute Tagging Occurrence Tagging View Management Data Organization Collection

Examples of Implied KnowledgeSenate Legislative Activities • Structural knowledge • Pertinent information embedded in document headers • Procedural knowledge • Naming convention • Senator represented by last name • Senator represented by last name and state • Senator represented by last name, first name, and state • Collection knowledge • Referenced senators include senators no longer in the senate

Knowledge Generation • Accessioning Template • Defines the concepts under which the data objects will be tagged and organized • Attribute selection • Define the attributes that represent the information content associated with the domain concepts • Tag attributes using minimal constraint language, such as XML or XMLSchema • Evaluate closure of mined attributes compared to expected attributes • Refine concept map

Information Generation • Create occurrence index • (Occurrence, attribute, value) • This is needed to be able to recreate original form of digital object • Analyze completeness of information • Inverse index of attribute values • Identifies unexpected values - consistency • Analyze closure of collection • Are additional attributes needed to represent inverse index value ranges?

Data Organization • Archive preferred views of collection • Original data • XML tagged representation • Minimal representation of consolidated information • ‘Noise-free’version based upon occurrence tags • Object-relational database version • Archive occurrence tagged view • Archive ingestion procedures that transform collection from the original digital objects to the preferred views

Information Management Projects • Digital Libraries • NSF Digital Library Initiative, Phase II - UCSB, Stanford • Digital Embryo digital library - GMU • NPACI Digital Sky - Caltech 2MASS sky survey • CDL - AMICO • NSF NSDL - UCAR / DLESE • Grid Environments • NASA Information Power Grid - NASA Ames • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NSF Grid Physics Network - U Fl • Persistent Archives • NARA Persistent Archive • NHPRC - Scalable archives

ERA Concept model

File SID DBLobj SID Obj SID SRB Unix DB2 Oracle ADSM HPSS Data Handling System SDSC Storage Resource Broker & Meta-data Catalog Application Resource Third-party copy User Remote Proxies MCAT Dublin Core DataCutter Application Meta-data

1. NVO Portals and Workbenches NVO Data Grid 2. Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Concept space Standard APIs and Protocols 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7. Data model, schema, services, and mapping to NVO concept space published into (2) when a collection joins the federation

Persistent Archive Framework • Persistent archive functionality - Accessioning platform • Data management - Archive Markup Language (AML), Container management • Collection management - Validation of collection, collection characterization • Knowledge management - Workflow staging, procedure management for ingestion process, anomaly detection, characterization of inherent implied knowledge • Scale - collections of millions to billions of objects

Globus Data Grid Architecture Appln Discipline-Specific Data Grid Application Coherency control, replica selection, task management, virtual data catalog, virtual data code catalog, … User Replica catalog, replica management, co-allocation, certificate authorities, metadata catalogs, Collective Access to data, access to computers, access to network performance data, … Resource Communication, service discovery (DNS), authentication, authorization, delegation Connect Storage systems, clusters, networks, network caches, … Fabric

Persistent Archive Framework • Persistent archive functionality - Repository • Data management - Storage system (robot, media, caching software), media migration, disaster recovery (archive namespace to container mapping) • Collection management - Container to object mapping, object metadata storage • Knowledge management - Transaction logging, AML migration on access or on media migration • Scale - thousands of collections, billions of objects, petabytes of data

Globus Protocols, Services, and Interfaces Occur at Each Level Applications Languages/Frameworks User Service APIs and SDKs User Service Protocols User Services Collective Service APIs and SDKs Collective Service Protocols Collective Services Resource APIs and SDKs Resource Service Protocols Resource Services Connectivity APIs Connectivity Protocols Local Access APIs and Protocols Fabric Layer

Persistent Archive Framework • Persistent archive functionality - Access platform • Data management - Data caching, container caching, disk cache management • Information management - Collection instantiation, access query, browsing support • Knowledge management - Order processing and workflow tracking, product authentication, usage characterization, presentation management • Scale - Millions of accesses per day

Application “Specialized services”: user- or appln-specific distributed services Application User Internet Protocol Architecture “Managing multiple resources”: ubiquitous infrastructure services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Globus Layered Grid Architecture(By Analogy to Internet Architecture)

Persistent Archive Framework • Persistent archive functionality - ARC • Data management - Finding aid storage • Collection management - Catalog of collections, access query, browse, disaster backup mechanisms, collection discriptors • Knowledge management - Characterization of finding aid efficiency, presentation management, concept spaces spanning collections • Scale - thousands of collections

referenced items & collections referenced items & collections Referenced Items & Collections Portals & Clients Portals & Clients Portals & Clients NSDL Services NSDL Services Other NSDL Services NSDL Collections NSDL Collections NSDL Collections Core Services: annotation CI Services query transform CI Services topic-map registry Core Services: metadata normalizing CI Services personalization Core Collection- Building Services metadata harvesting CI Services discussion Core Collection- Building Services persistent storage CI Services visualization... User Interfaces Usage Enhancement Delivery Presentation Aggregation - Channels Information about collections Core NSDL Bus Meta-data delivery Data delivery Query Global Ids Security Network Metadata & data access-based services Virtual Collections & Mediators Collection Building

Cross Cutting Issues • Global namespace • Metadata used by data handling system to locate containers • Metadata used to characterize objects in containers • Metadata used to characterize collections • Metadata used to locate collections • Consistency of metadata while updating

Cross Cutting Issues • Knowledge management • Workflow systems to monitor state of system, monitor transactions, monitor updates to system architecture, monitor consistency of global namespace • Data distribution • Caching of data between accessioning platform, archive, and access platform • Consistency during updates

Cross Cutting Issues • Security • Authentication across platforms • Authorization across platforms for updates • Consistency of architecture • Audit trails for updates • Validation of integrity of system • State management for system components

Research Challenges- 2000 • Infrastructure independence • Progress on archivable form creation • Digital paper • Finding aids for a million collections • Concept spaces that support identification of collection • Product authentication • Tracking all updates, movements, media migrations, collection instantiations • Choice of Archival Markup Language • Tracking of E-commerce implementations • Knowledge management systems • Workflow, ingestion processing steps, system evolution procedures, finding aid concept spaces

Further Information http://www.npaci.edu/DICE

Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center

Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center

Presentation Transcript

Cyberinfrastructure, E-Science and the San Diego Supercomputer Center

Data Grids for Collection Federation Reagan W. Moore University of California, San Diego San Diego Supercomputer Center

Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu htt

Preservation and Long Term Access to Data and Records in a Knowledge-based Society Reagan W. Moore San Diego Supercomput

Persistent Management of Distributed Data Reagan W. Moore University of California, San Diego San Diego Supercomputer Ce

Collection- and Knowledge-Based Persistent Archives at SDSC

Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart University of California, San Diego San Diego Supercomput

Welcome to the San Diego Supercomputer Center

Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

Bertram Lud ä scher San Diego Supercomputer Center ludaesch@SDSC

Ilya Zaslavsky San Diego Supercomputer Center, UCSD

Data Management Services Reagan W. Moore San Diego Supercomputer Center

Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center U.C. San Diego

Arun Jagatheesan Reagan Moore San Diego Supercomputer Center (SDSC)

Collection-based Persistent Archives

Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc.

Bing Zhu, Uni. of California: San Diego Richard Marciano Reagan Moore

Persistent Archive Research Group GGF5 Reagan W. Moore San Diego Supercomputer Center

Computing and Storage Resources at the San Diego Supercomputer Center

Collection-Based Persistent Archives

San Diego Supercomputer Center, UCSD

Arcot Rajasekar Michael Wan Reagan W. Moore (sekar, mwan, moore)@sdsc