270 likes | 363 Views
Workshop on Research Challenges in Digital Archiving: Towards a National Infrastructure for Long-Term Preservation of Digital Information Architecture for Repositories Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/. Staff Reagan Moore
E N D
Workshop on Research Challenges in Digital Archiving: Towards a National Infrastructure for Long-Term Preservation of Digital Information Architecture for Repositories Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/
Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek M. Kulrul Bertram Ludäscher Richard Marciano A. Memon XuFei Qian Roman Olshanowsky Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Graduate Students A. Bagchi S. Bansal A. Behere R. Bharath S. Bharath L. Sui Undergraduate Interns N. Cotofana D. Le J. Trang L. Yin +/- NN Data and Knowledge Systems Group
Topics • Digital entities • Digital archiving approaches • Common infrastructure requirements • Research issues
Digital Archive Approaches • Archivist - support archival processes • Librarian - provide metadata catalog • Scientist - manage data copies on tape • Technologist - manage technology evolution
Digital Entities • Digital entities are “images of reality”, made of • Data, the bits (zeros and ones) put on a storage system • Information, the attributes used to assign semantic meaning to the data • Knowledge, the structural relationships described by a data model • Every digital entity requires information and knowledge to correctly interpret and display
Manipulating Images of Reality Application Operating System Storage System Display System Digital Object
Technology Management - Emulation Old Application Wrap Application New Operating System New Storage System New Display System Digital Object
Technology Management Migration New Application New Operating System New Storage System New Display System Migrate Encoding Format Digital Object
Technology Management - SDSC New Application New Operating System Wrap Storage System Wrap Display System Old Storage System Old Display System Migrate Encoding Format Digital Object
Data, Information, and Knowledge Content of Digital Entities • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional
Infrastructure Independence • Characterize components of digital entity • Data • Information • Knowledge • Characterize repositories • Storage systems • Databases • Knowledge repositories
Levels of Abstraction for Data Logical: Data Model (units, semantics) Physical: Encoding Format (syntax, structure) Abstraction for Digital Entity Digital Entity Files Abstraction for Repository Logical: Name Space Physical: Data Handling System -SRB/MCAT Repository File System, Archive
Information Management • Abstraction layer for interacting with information repositories • Manage the schema and physical table structures of a database • Extensible schema • User defined attributes • Extensible Metadata CATalog (EMCAT) manages collections • mySRB.html interface supports dynamic collection creation
Levels of Abstraction for Information Logical: Collection Schema Physical: XML Syntax Abstraction for Digital Entity Digital Entity Metadata Attributes Abstraction for Repository Logical: Database Schema Physical: EMCAT/CWM Repository Database
C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Levels of Abstraction Application Linux I/O Web WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication Prime Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase Servers HRM
Knowledge Management - Characterizing Properties of Collections • Characterization of relationships between attributes • Semantic / logical - cross-walks • Procedural / temporal - records management • Structural / spatial - GIS • Characterization of knowledge repository operations • Mapping from collection attributes to discipline concepts • Mapping from knowledge relationships to rules for application in inference engines
Levels of Abstraction for Knowledge Logical: Relationship Schema Physical: ER/UML/XMI/ RDF syntax Abstraction for Digital Entity Concept Space (ontology instance) Digital Entity Abstraction for Repository Logical: Knowledge Repository Schema Physical: Model-based Mediation System Repository Knowledge Repository
Data Management Systems • Persistent archives • Archival processes • Management of technology evolution • Data grids • Federation of multiple collections • Data management across administration domains • Digital libraries • Management of services for discovery, display, annotation
Processes versus Data • Build infrastructure to support processing of digital entities • Build infrastructure to manage storage of digital entities • Build infrastructure to support access to digital entities
Archival Processes Appraisal –determine the archivable content Accession - determine the initial physical location for the data, and the relationship of the new collection to existing collections • Arrangement - add administration control, describe the information content (provenance, authenticity, structure, administrative), and decompose digital objects into their components as needed. • Description - complete the definition of collection attributes by iterating between arrangement, reformatting, and representation. • Preservation – build an archivable form of the digital entities, characterize the collection context , and manage their storage Access – provide query mechanisms for discovering, retrieving, and presenting the digital entities.
Self-Instantiating Archive • Archive the processes that are used to control the ingestion process • Conversion to archivable form • Annotation of information content • When accessing the collection, retrieve the processes and the original digital objects • Apply the processing steps to re-create the information content • Query the result to discover desired digital objects • A self-instantiating archive is a virtual data grid
Persistent Archives(Similar requirements to a data grid) • Name transparency • Find a file by attributes (map from attributes to global name) • Location transparency • Access a file by a global identifier (map from global to local file name) • Access transparency • Use same API to access data in archive or file cache • Authenticity • Disaster recovery, replicate data across storage systems • Audit and process management
Common Approach (digital library, data grid, persistent archive) • Logical name space used to organize digital entities, and associate attributes • Separation of information management from data storage management • Definition of abstraction mechanisms for dealing with repositories • Emergence of need for knowledge management
Knowledge Based Data Grids Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Model-based Access) XML DTD Information Repository Attribute- based Query Attributes Semantics SDLIP Information (Data Handling System - SRB) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF
Further Information http://www.npaci.edu/DICE