1 / 27

Workshop on Research Challenges in Digital Archiving: Towards a National Infrastructure for Long-Term Preservation of Di

Workshop on Research Challenges in Digital Archiving: Towards a National Infrastructure for Long-Term Preservation of Digital Information Architecture for Repositories Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/. Staff Reagan Moore

merrill
Download Presentation

Workshop on Research Challenges in Digital Archiving: Towards a National Infrastructure for Long-Term Preservation of Di

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workshop on Research Challenges in Digital Archiving: Towards a National Infrastructure for Long-Term Preservation of Digital Information Architecture for Repositories Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/

  2. Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek M. Kulrul Bertram Ludäscher Richard Marciano A. Memon XuFei Qian Roman Olshanowsky Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Graduate Students A. Bagchi S. Bansal A. Behere R. Bharath S. Bharath L. Sui Undergraduate Interns N. Cotofana D. Le J. Trang L. Yin +/- NN Data and Knowledge Systems Group

  3. Topics • Digital entities • Digital archiving approaches • Common infrastructure requirements • Research issues

  4. Digital Archive Approaches • Archivist - support archival processes • Librarian - provide metadata catalog • Scientist - manage data copies on tape • Technologist - manage technology evolution

  5. Digital Entities • Digital entities are “images of reality”, made of • Data, the bits (zeros and ones) put on a storage system • Information, the attributes used to assign semantic meaning to the data • Knowledge, the structural relationships described by a data model • Every digital entity requires information and knowledge to correctly interpret and display

  6. Manipulating Images of Reality Application Operating System Storage System Display System Digital Object

  7. Technology Management - Emulation Old Application Wrap Application New Operating System New Storage System New Display System Digital Object

  8. Technology Management Migration New Application New Operating System New Storage System New Display System Migrate Encoding Format Digital Object

  9. Technology Management - SDSC New Application New Operating System Wrap Storage System Wrap Display System Old Storage System Old Display System Migrate Encoding Format Digital Object

  10. Data, Information, and Knowledge Content of Digital Entities • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional

  11. Infrastructure Independence • Characterize components of digital entity • Data • Information • Knowledge • Characterize repositories • Storage systems • Databases • Knowledge repositories

  12. Levels of Abstraction for Data Logical: Data Model (units, semantics) Physical: Encoding Format (syntax, structure) Abstraction for Digital Entity Digital Entity Files Abstraction for Repository Logical: Name Space Physical: Data Handling System -SRB/MCAT Repository File System, Archive

  13. Information Management • Abstraction layer for interacting with information repositories • Manage the schema and physical table structures of a database • Extensible schema • User defined attributes • Extensible Metadata CATalog (EMCAT) manages collections • mySRB.html interface supports dynamic collection creation

  14. Levels of Abstraction for Information Logical: Collection Schema Physical: XML Syntax Abstraction for Digital Entity Digital Entity Metadata Attributes Abstraction for Repository Logical: Database Schema Physical: EMCAT/CWM Repository Database

  15. C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Levels of Abstraction Application Linux I/O Web WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication Prime Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase Servers HRM

  16. Knowledge Management - Characterizing Properties of Collections • Characterization of relationships between attributes • Semantic / logical - cross-walks • Procedural / temporal - records management • Structural / spatial - GIS • Characterization of knowledge repository operations • Mapping from collection attributes to discipline concepts • Mapping from knowledge relationships to rules for application in inference engines

  17. Levels of Abstraction for Knowledge Logical: Relationship Schema Physical: ER/UML/XMI/ RDF syntax Abstraction for Digital Entity Concept Space (ontology instance) Digital Entity Abstraction for Repository Logical: Knowledge Repository Schema Physical: Model-based Mediation System Repository Knowledge Repository

  18. Data Management Systems • Persistent archives • Archival processes • Management of technology evolution • Data grids • Federation of multiple collections • Data management across administration domains • Digital libraries • Management of services for discovery, display, annotation

  19. Processes versus Data • Build infrastructure to support processing of digital entities • Build infrastructure to manage storage of digital entities • Build infrastructure to support access to digital entities

  20. ERA Concept model

  21. Archival Processes  Appraisal –determine the archivable content  Accession - determine the initial physical location for the data, and the relationship of the new collection to existing collections • Arrangement - add administration control, describe the information content (provenance, authenticity, structure, administrative), and decompose digital objects into their components as needed. • Description - complete the definition of collection attributes by iterating between arrangement, reformatting, and representation. • Preservation – build an archivable form of the digital entities, characterize the collection context , and manage their storage  Access – provide query mechanisms for discovering, retrieving, and presenting the digital entities.

  22. Self-Instantiating Archive • Archive the processes that are used to control the ingestion process • Conversion to archivable form • Annotation of information content • When accessing the collection, retrieve the processes and the original digital objects • Apply the processing steps to re-create the information content • Query the result to discover desired digital objects • A self-instantiating archive is a virtual data grid

  23. Persistent Archives(Similar requirements to a data grid) • Name transparency • Find a file by attributes (map from attributes to global name) • Location transparency • Access a file by a global identifier (map from global to local file name) • Access transparency • Use same API to access data in archive or file cache • Authenticity • Disaster recovery, replicate data across storage systems • Audit and process management

  24. Common Approach (digital library, data grid, persistent archive) • Logical name space used to organize digital entities, and associate attributes • Separation of information management from data storage management • Definition of abstraction mechanisms for dealing with repositories • Emergence of need for knowledge management

  25. Data Naming Ontologies

  26. Knowledge Based Data Grids Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Model-based Access) XML DTD Information Repository Attribute- based Query Attributes Semantics SDLIP Information (Data Handling System - SRB) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF

  27. Further Information http://www.npaci.edu/DICE

More Related