260 likes | 353 Views
Depositing e-material to The National Library of Sweden. KB - Overview. 1661 – First legal deposit law 1877 – Becomes a government institution 1996 – First steps in digitization 1997 – Kulturarw3 - the first collection of the Swedish web
E N D
KB - Overview 1661 – First legal deposit law 1877 – Becomes a government institution 1996 – First steps in digitization 1997 – Kulturarw3 - the first collection of the Swedish web 20?? – Deposit law expanded to include electronically published documents
KB – Aim of repository • Be able to receive different kinds of data in different kinds of formats • Be able to handle large amounts of incoming data (scalability) • Have a flexible and modular design • Be able to utilize services that can receive data from organizations with different technical capabilities • A system for long term preservation and presentation
Reality – Types of material • Will receive widely different kinds of materials • Different: • file formats • metadata formats • structure of data • naming schemas • From a lot of different sources • Local file system, FTP, Database, URL on the web • Should still try to use the same services • Solution: • Normalize received material to an internal format • Represent data + metadata as DIDLXML
Fundamentals of deposit system • Modular design • One internal format for representing packages • Try to use as simple interfaces between services as possible • REST services (HTTP + XML) • Message Queue to drop packages for the system in • This makes the system independent of platform and programming framework • Each module should be highly configurable with smaller sub-components • Build services as chains of simple components concerned with just one task • Use Spring Framework for configuration
Internal package format • Uses Digital Item Declaration Language (DIDL) • An MPEG-21 standard • An XML format for both data and metadata • Do not inline data, just metadata • Store datastreams centrally and reference • 1 DIDL file = 1 ”object” • One package has: • ID • Type • List of Attributes (name/value pairs) • List of Metadata (as XML) • List of Resources (as references)
Internal package format • Represent a package as a DIDL file • Parser to read a DIDL file into a Java object • Serializer to write a Java object to a DIDL file • Usually works with the package as a Java object • BUT: • Only plain XML is sent between services • Decouples services from programming language, anything that can handle XML is fine
Internal package format - Attributes • Attributes • Name/value pairs (Example: page-number = 5) • Flexible way of representing additional information about a package
Internal package format - Metadata • Metadata • Name • Description (optional) • XML that represents the metadata
Internal package format - Resource • Resource • ID • Mimetype • List of Attributes (for this Resource only) • List of Metadata (for this Resource only) • Reference to the datastream (a URL)
Package normalizer • Takes data in one format and creates an internal package • Creates the DIDL file and writes the datastreams to the Resource Store • Places the package on a queue for further processing • One normalizer per type of data package delivered • Has to know the contract for the delivered data • Looks in an inbox at regular intervals for new packages • File system directory • Data could be delivered via FTP or file copy on local file system • URL • OAI-PMH server with metadata that has links to actual resources • OAI-ORE fits in nicely here • Database • Web form operated by human • Anything else?
Enriching a package • REST service • POST a DIDL file and get it back enriched • Implemented with Spring and a chain of enrichers • Each doing one specific task, for example adding a urn:nbn • Some only make sense for a specific kind of package • Can be a different set of enrichers for different package types • Examples of enrichers • Adding urn:nbn • Updating MARCXML to reflect that it is an electronic copy • Adding extracted technical metadata from JHove or DROID • And so on... • Possible to have enrichers that involves human intervention
Validating a package • Similar in design to Enricher • REST service • POST a DIDL file and get back a status report • Implemented with Spring and a chain of tests • Each test doing one specific task • Some only make sense for a specific kind of package • Can be a different set of tests for different package types • Examples of tests • Verifying that a PDF is readable • Validating metadata • And so on... • Possible to have tests that involves human intervention
Ingest • REST service • PUT a DIDL file and get back an id pointing into the repository • In future: • Perhaps add possibility to update or delete package in repository using POST and DELETE • Abstraction that hides the actual repository used • Can change repository without affecting rest of the system • Repository dependant enrichments and tests can be done here • We use Fedora as our repository • The same principal is used for ingestion into the long-term preservation archive
Fedora • Fedora is used as the repository • Reasons why: • Open-source • Actively developed • Large (and growing) user base • Good design and nice features • We use version 2.2 • obviously going to move to 3.0 in the future • Used for storage and presentation • Stores both relevant datastreams and metadata • Have relations between datastreams (i.e. sequence-number) • Possible to search against the repository • As standard search against DC fields
Fedora – Content Models • Content Model • A contract of available Datastreams and Behaviour Definitions in a Fedora record • In Fedora 2.x just an informal agreement • But from Fedora 3.0 a new mechanism exists for this • Called Content Model Architecture (CMA) • A Content Model could involve multiple Fedora records • Atomistic versus Compund model • Also specifies relations • Both between datastreams and Fedora records • Using RDF in the RELS-EXT datastream
Fedora - An example Content Model • PagedObject Content Model • Used for digitized material where each page is an image • Atomistic, i.e. one page becomes one Fedora record • Also has one Fedora record for the object as a whole • Record for the object • Datastreams • DC • MODS • MARCXML • Behaviour Definitions • view • list • getPreview • Relations • member of a collection • member of OAI-PMH set • Record for an individual page • Datastreams • WEBIMAGE • THUMBNAIL • Behaviour Definitions • getImage • getZoom • Relations • member of the object • sequence-number etc.
Fedora - Ingest • Gets a DIDL package and creates corresponding FOXML • Different FOXML for different Content Models • Which Content Model depends on Type of package • A Content Model can result in multiple FOXML files (and accordingly multiple Fedora records) • Uses Fedora's Web Services to ingest the FOXML to the repository • The datastreams are also transferred to the Fedora repository • (Also a urn:nbn is mapped to the objects location in Fedora)
Fedora - Access • Built-in search system • Search for DC terms and some Fedora terms • Built-in OAI-PMH provider • We give access to DC, MODS and MARCXML • Built-in RDF Query Server • Query against the RDF in RELS-EXT • In future: OAI-ORE provider for Fedora • We provide our own viewer for digitized objects • Developed with Google Web Toolkit (GWT) • Has one tab with an overview of all pages • Another tab with an individual page with zooming functionality and the ability to navigate between pages • Some simple metadata displayed
A demo of viewing e-material from our Fedora repository. Accessing SOT from LIBRIS. Example