510 likes | 803 Views
... Information Environment Service Registry, Internet Archive ARC file format, OAIS concepts, XML, XML Schema, ... aDORe Archive software (Layer 1: XMLtape & ARCfiles) is available ...
E N D
Slide 2:Context Fact:
LANL Research Library stores a significant scholarly collection locally (A&I databases, journal articles, …) and creates applications based on that collection.
Initial aDORe motivation:
Undo tight integration between data and application
Uniform approach for ingesting, storing, and disseminating LANL RL data collections
Bigger picture:
Allow for multiple, parallel applications on top of stored content
Create an environment that provides guarantees regarding long-term accessibility of stored content
Slide 3:aDORe characteristics Standards-based:
MPEG-21 Digital Item Declaration, the MPEG-21 Digital Item Identification, URI, info URI, OAI-PMH, NISO OpenURL, SRU, Information Environment Service Registry, Internet Archive ARC file format, OAIS concepts, XML, XML Schema, XQuery.
Component-based, highly modular:
Multiple content repositories, Identifier Locator, Service Registry, Format Registry, Semantic Registry, Harvesting front-end, Dissemination front-end
Protocol-based:
Components expose (REST-based) Web services
All “read” services based on 4 standards: OAI-PMH, NISO OpenURL, SRU, Xquery.
Interaction between modules is protocol-driven.
Slide 4:aDORe characteristics Scalable
Scalable
Etc.
Slide 5:aDORe effort aDORe is 2 things:
A standards-based, repository federation architecture
Actual implementation of the architecture at LANL for local storage of digital assets
Prototype version was in production for 2 years!
Production version finalized June 2007.
Slide 6:aDORe overview Representing Digital Objects
MPEG-21 DID & DIDL to represent Digital Objects using XML packages
Identification of Digital Objects, datastreams, and XML Packages
Storing Digital Objects
Autonomous distributed repositories with OAI-PMH and OpenURL-based service interfaces
Locating Digital Objects, datastreams, and XML Packages
Identifier Locator
Registries:
Service Registry: Locating service interfaces for autonomous distributed repositories
Format Registry: Sharing media type identifiers across autonomous distributed repositories
Semantic Registry: Sharing intellectual content type identifiers across autonomous distributed repositories
Providing federated access to the autonomous distributed repositories:
OAI-PMH Federator: Harvesting XML packages
OpenURL Resolver: Requesting services pertaining to Digital Objects, datastreams, and XML Packages
Slide 7:Representing Digital Objects
Slide 8:sample Digital Object Create an XML-based surrogate for each Digital Object:
Glues all components together in a single XML Package
Contains all required metadata (descriptive, technical, identifiers, …) in the XML Package
Initial access format for all materials is the same (XML) irrespective of their native media type
Assign identifiers to the XML Package, the Digital Object, the datstreams. Maintain original identifiers.
Slide 9:representing Digital Objects using MPEG-21 DID & DIDL An XML Package is available for every Digital Object
The Package is an XML document compliant with the MPEG-21 Digital Item Declaration Language ~ DIDL document
The DIDL document typically contains:
By-Value: descriptive metadata datastream & ingest/repository related metadata
By-Reference: all constituent datastreams of the Digital Object
Creation of DIDL documents can be:
static, at ingestion time, cf. for aDORe Archive
dynamic, via add-on capability to existing content management system, cf. Ghent University eRez add-ons
A new DIDL document is created when a new version of a previously ingested Digital Object is ingested (update is considered re-ingestion).
Slide 10:sample Digital Object
Slide 11:representing Digital Objects using MPEG-21 DID
Slide 12:Identification: digital objects, datastreams, DIDL documents
Slide 13:aDORe DIDLTools aDORe DIDLTools software is available from http://african.lanl.gov/aDORe/projects/DIDLTools/
Slide 14:The aDORe architecture
Slide 15:the aDORe architecture : 3 layers Layer 1: the aDORe repositories
Networked systems that host digital object content and that make that content accessible by exposing core service interfaces.
In LANL Implementation: XMLtapes and ARCfiles (aDORe Archive)
Other Content Management Systems can be turned into an aDORe repository by implementing the core service interfaces.
Layer 2: the aDORe federation components
Networked systems that facilitate presenting the aDORe repositories as a single logical repository; these federation components expose core service interfaces to allow access to their content.
Federation components are: Identifier Locator, Service Registry, Format Registry, Semantic Registry
Layer 3: the aDORe front-ends
Networked systems that make digital object content hosted in the multitude of physical aDORe repositories accessible by exposing core services interfaces that present those aDORe repositories as a single logical repository
aDORe front-ends are: OAI-PMH Federator, OpenURL Resolver
Slide 17:The aDORe architecture
Slide 19:aDORe repositories Networked systems that host digital object content and that have core service interfaces to facilitate access that content.
Currently 2 types in LANL implementation:
XMLtapes concatenating XML Packages
ARCfiles concatenating datastreams
Combination of OAI-PMH and OpenURL-based core service interfaces
Generic XMLtape XQuery Resolver
Other Content Management Systems can be turned into an aDORe repository by implementing the core service interfaces.
Cf. Aleph
Cf. Ghent University eRez
Slide 20:aDORe Archive : XMLtapes
Slide 21:aDORe Archive : XMLtape XQuery Resolver
Slide 22:aDORe Archive : ARCfiles
Slide 23:The aDORe architecture
Slide 25:Identifier Locator Stores all identifiers of aDORe repositories (DIDLDocumentIdentifier, digital object identifier, datastream identifier)
Loaded by retrieving identifiers from aDORe repositories using their “give me your identifiers” OpenURL service interface
Stores [identifier, repository identifier]
1 OpenURL-based service interface to the Identifier Locator
Slide 27:Service Registry
Slide 28:Registries: Service Registry
Slide 32:Registries: Format Registry
Slide 33:Registries: Semantic Registry
Slide 34:The aDORe architecture
Slide 36:Expose aDORe repositories as a single repository
Slide 37:OAI-PMH Federator
Slide 38:OpenURL Resolver
Slide 40:OpenURL Resolver (a bit more)
Slide 42:LANL aDORe implementation
Slide 43:LANL aDORe software Largely based on off-the-shelf software components:
Berkeley DB Java Edition
Heritrix tookit
MySQL db
OCLC OAICat
OCLC OpenURL software
Ockam IESR service registry
aDORe Archive software (Layer 1: XMLtape & ARCfiles) is available from http://african.lanl.gov/aDORe/projects/adoreArchive/
Plans to “one way or another” make the entire LANL aDORe solution (revised Layer 1, Layer 2, Layer 3) available.
Slide 44:LANL aDORe @ 2 Sep 2007
Slide 45:LANL aDORe hardware
Slide 46:LANL aDORe Performance
Slide 47:aDORe Ingestion : Overview
Slide 48:Conclusion aDORe Archive:
The file-based approach (XMLtape/ARCfile) is inherently simple, and reduces dependency on database systems.
The XMLtape approach is inspired by the ARC file format, but provides several additional attractive features:
Off-the-shelf XML tools can be used to parse/validate an XMLtape
All Digital Object metadata can be stored in XML Package
The autonomy of the indexes allows retaining the files over time, while the indexes can be created using other techniques as technologies evolve.
Can throw all indexes out and just start from scratch.
Data integrity:
XMLpackage contains SHA1 digest for each datastream of the Digital Object represented by the XML Package
SHA1 digest for each XMLtape and ARCfile stored in XMLtape Registry, and ARCfile Registry, respectively
Slide 49:Conclusion aDORe:
The protocol-based nature of the access increases the flexibility in light of evolving technologies through the introduction of a layer of abstraction.
Can throw whichever technology out and re-implement the same protocol interface using another technology.
The protocol-based nature of the solution allows a fully distributed implementation.
The component-based nature yields scalability.
The standard-based design allows the use of off-the-shelf tools.
A standard-based approach typically allows for a less painless migration (to a new standard).
All kinds of Content Management Systems can be aDORe-ized.