270 likes | 419 Views
File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files. Xiaoming Liu (1) , Luda Balakireva (1) , Patrick Hochstenbach (2) and Herbert Van de Sompel (1) (1) Digital Library Research & Prototyping Team
E N D
File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files Xiaoming Liu (1), Luda Balakireva (1), Patrick Hochstenbach (2) and Herbert Van de Sompel (1) (1) Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory (2) University Library Ghent University liu_x@lanl.gov , ludab@lanl.gov , patrick.hochstenbach@ugent.be , herbertv@lanl.gov
Disclaimer • The term Digital Object (DO) will be used as in Kahn/Wilensky: • Compound object • Multiple datastreams of different mime types • Secondary information pertaining to object and datastreams • Identifiers for object (and datastreams) • This is ~ OAIS Content Information
XML-based representation of DOs • Growing interest in XML-based representation of DOs in Digital Library architectures: • Platform-independence, • Industry-support • Longevity, potential migration paths • Processing tools, validation capabilities • XML-based Compound Object formats: • ISO/IEC 21000-2 MPEG-21 DID & DIDL • METS • IMS/CP • CCDS XFDU • Typical functionality: • By-Value (base64) and/or By-Reference provision of constituent datastreams • By-Value and/or By-Reference provision of secondary information • Provision of identifiers
Storing XML-based representations of DOs • Existing approaches: • storage of the XML-representations as individual files in a file system: • Poor access performance • Poor backup performance • storage of the XML-representations in (SQL, XML, object) databases • Long term? Data are dependent on the underlying system • storage of the XML-representations by concatenating many such documents into a single file such as tar or zip • Not XML aware, hence, no use of off-the-shelf XML tools • Increasing storage space (base64-encoding of the constituent datastreams)
aDORe XMLtape/ARCfile solution • Part of LANL aDORe repository effort: • Standards-based, modular repository architecture • Distributed architecture • Protocol-based interactions between modules • Usable to create interoperable federations of heterogeneous repositories • Actual implementation of the architecture at LANL • Components of aDORe software will be released • Inspired by Internet Archive ARC file approach: • File-based mechanism to store datastreams resulting from Web-crawling • Concatenation of multiple datastreams into a single file • Metadata as seperators between datastreams • But not OK to store XML-based representations of DOs: • Metadata capabilities very limited & crawling related • Lose power of XML processing tools
aDORe XMLtape/ARCfile solution • Two interconnected file-based storage mechanisms: • XMLtapes:File storage of XML-based representations of Digital Objects • ARCfiles: File storage of constituent datastreams of Digital Objects • The ARC files are interconnected with one or more XMLtapes during the ingestion process • A protocol-based access mechanism is introduced: • XMLtape is exposed as an autonomous OAI-PMH repository • ARCfile is exposed as an OpenURL Resolver • Write once - Read many: • Files remain stable • Protocol-based access mechanism remains stable • Indexing mechanisms can change as technologies evolve • Storage approach is independent from the compound object format used to represent DOs as XML • aDORe uses MPEG-21 DIDL
based on based on has XML serialization MPEG-21 Abstract Model MPEG-21 DIDL ISO/IEC 21000-2: MPEG-21 DID & DIDL has XML serialization has declaration Digital Item Declaration DIDL document Digital Item
Digital Object Package Representing DOs using MPEG-21 DID sample DIDL document
aDORe XMLtape • An XML file that concatenates the XML-based representations of multiple DOs • Structure is defined by an XML Schema • http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtape.xsd • tape-level administrative section: • Open-ended content • Plug-in for processing-related information, indication of related ARCfiles: • http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtapeBasics.xsd • concatenation of records, each of which consists of: • record-level administrative section • identifier and datestamp of the contained record • other record-level administrative information • a record (can be from any XML Namespace). DIDL in case of aDORe: • http://purl.lanl.gov/aDORe/schemas/2005-08/DIDL.xsd • An XMLtape is a valid and well-formed XML file • Independent from chosen XML-based Compound Object Format
aDORe XMLtape <?xml version="1.0" encoding="UTF-8"?> <ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/" <ta:tapeAdmin> ... </ta:tapeAdmin> <ta:tapeRecord> <ta:tapeRecordAdmin> <ta:identifier>oai:aps.org:PhysRevA.71.040101</ta:identifier> <ta:date>2005-03-29T04:31:22Z</ta:date> <ta:recordAdmin> ... </ta:recordAdmin> </ta:tapeRecordAdmin> <ta:record> <didl:DIDL>...</didl:DIDL> </ta:record> </ta:tapeRecord> </ta:tape> aDORe ta:tape sample XMLtape
record record record record record record record record aDORe XMLtape index XMLtape index identifier datestamp of ingestion identifier datestamp of ingestion identifier datestamp of ingestion Indexing: • Can be achieved with a variety of technologies • Current implementation: Berkeley DB Java Edition <ta:tapeRecordAdmin>
record record record record record record record record aDORe XMLtape as OAI-PMH repository XMLtape index OAI-PMH request DIDL document OAI-PMH identifier = identifier from <ta:tapeRecordAdmin> OAI-PMH datestamp = datetime from <ta:tapeRecordAdmin> OAI-PMH response = content of <ta:record>
Internet Archive ARCfile • Concatenation of binary files • Designed and used by the Internet Archive (Wayback machine) • > 400 TB web data • Under revision by the International Internet Preservation Consortium (IIPC): WARC file format • Input from LANL to facilitate non-Web-crawling use case • The ARC file format is structured as follows: • file header that provides administrative information about the ARC file itself • a sequence of document records, consisting of: • a header line containing some, mainly crawl-related, metadata. • URI of the crawled document • timestamp of acquisition of the data • size of the data block • a response to a protocol request such as an HTTP GET
Internet Archive ARC file filedesc://IA-001102.arc 0 19960923142103 text/plain 761 0 Alexa InternetURL IP-address Archive-date Content-type Archive-length http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202HTTP/1.0 200 Document followsDate: Mon, 04 Nov 1996 14:21:06 GMTServer: NCSA/1.4.1Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMTContent-length: 30<HTML>Hello World!!! </HTML> sample ARC file
Internet Archive ARC file in aDORe filedesc://singletape.arc 0.0.0.0 20050922142103 text/plain 76 1 0 Internet Archive URL IP-address Archive-date Content-type Archive-length info:lanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b989a0.0.0.0 20050907221344 application/pdf 415025 %PDF-1.3 %âãÏÓ 290 0 obj << /Linearized 1 /O 295 /H [ 3642 1057 ] /L 415025 … sample aDORe ARC file sample ARCfile
Internet Archive ARC file ARC index URL datastream URL datastream URL datastream datastream datastream datastream datastream Indexing: • Can be achieved with a variety of technologies • Current implementation in aDORe: Heritrix toolkit datastream URL IP-address Archive-date Content-type Archive-length
ARC file as OpenURL Resolver index ARC file datastream OpenURL OpenURL request datastream datastream datastream datastream datastream datastream datastream datastream Referent Identifier = datastream identifier = URL from ARC record header Resolver Identifier = identifier of ARC file
Associating an XMLtape with ARC Files (1) • A Digital Object is represented using an XML-based Complex Object format (e.g. MPEG-21 DID) • The resulting package (e.g. DIDL document) is stored in an XMLtape • Constituent datastreams of the Digital Object are provided By-Reference: • Using the ref attribute of the Resource element in MPEG-21 DID • The value of the network location of the constituent datastream is compliant with the NISO OpenURL Framework: baseURL(ARCfile OpenURL Resolver)? url_ver = Z39.88-2004 & rft_id = Datastream Identifier & res_id = ARCfile identifier
Associating an XMLtape with ARC Files (1) <?xml version="1.0" encoding="UTF-8"?> <didl:DIDL> …… <didl:Component id="uuid-ddec9dbb-90e5-4b8a-93f3-dd1c8b781547"> <didl:Descriptor> <didl:Statement mimeType="application/xml; charset=utf-8"> <dii:Identifier … > info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b </dii:Identifier> </didl:Statement> </didl:Descriptor> <didl:Resource mimeType="application/pdf“ ref="http://purl.lanl.gov/aDORe/demo/adore-arcfile-resolver/resolver? url_ver=Z39.88-2004 res_id=info:lanl-repo/arc/2001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2 rft_id=info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b“/> </didl:Component> …… </didl:DIDL> Extract from DIDL
Associating an XMLtape with ARC Files (2) • An XMLtape is associated with its corresponding ARCfiles through a plug-in for the XMLtape-level administrative section.
Associating an XMLtape with ARC Files (2) <?xml version="1.0" encoding="UTF-8"?> <ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/"> <ta:tapeAdmin> <tb:XMLtapeBasics xmlns:tb="http://library.lanl.gov/2005-08/aDORe/XMLtapeBasics/“> <tb:XMLtapeId>info:lanl-repo/xmltape/singlescitape</tb:XMLtapeId> <tb:ARCfileId>info:lanl-repo/arc/singlescitape</tb:ARCfileId> <tb:processSoftware>gov.lanl.xmltape.SingleTapeWriter</tb:processSoftware> <tb:processTime>2005-09-07T22:13:39Z</tb:processTime> </tb:XMLtapeBasics> </ta:tapeAdmin> <ta:tapeRecord> <ta:tapeRecordAdmin> … </ta:tape> XMLtape header
DIDL document List of (baseURL, DIDLDocument-id) DIDLDocument-id or content-id DIDLDocument-id or content-id datastream ref datastream-id ref creation datetime index datastream-id index DIDLDocument-id index Identifier Locator DIDLDocument- id datastream id OpenURL XMLtape ARC file AGENT
Implementation • XMLtapes: • Berkeley DB Java Edition • OCLC OAICat • ARCfiles: • Heritrix • OCLC OpenURL software • XMLtape Registry • MySQL db • OCLC OAICat • ARCfile Registry: • MySQL db • OCLC OAICat
Performance indicators • System: • Model: Dell 2650 2U rack-mount server • CPU: dual 2.8 GHz Intel Xeon processors • RAM: 5GB RAM • Disks: 10k RPM SCSI disks • XMLtape: • 1786 MB, 201872 DIDL records • download 100 consecutive DIDL records (787 KB) => 0.18 second • download static file of same size => 0.09 second • ARCfile: • 272 MB, 4910 files • download a sample PDF file (312 KB) => 0.24 second • download static file of same size => 0.036 second
Software • Software - ARC files: • Heritrix: the internet archive's open-source, extensible, web-scale, archival-quality web crawler project. http://crawler.archive.org/ • NetArchive.dk: a project that plans for the preservation of Denmark's cultural heritage on the internet for future generations. http://www.netarchive.dk/ • Many other tools: http://archive-access.sourceforge.Net • XMLtapes: • Perl tool, XML::Tape (LANL & Ghent University), http://search.cpan.org/~hochsten/XML-Tape/ • Combined aDORe XMLtape/ARCfile environment: • Java tool (LANL), soon to be released on SourceForge
Conclusion • The file-based approach is inherently simple, and reduces dependency on database system. • The autonomy of the indexes allows retaining the files over time, while the indexes can be created using other techniques as technologies evolve. • The protocol-based nature of the access increases the flexibility in light of evolving technologies as it introduces another layer of abstraction. • The XMLtape approach is inspired by the ARC file format, but provides several additional attractive features: • Off-the-shelf XML tools can be used to parse/validate an XMLtape • All DO metadata can be stored in XML-based compound object format Presentation available via http://public.lanl.gov/herbertv/ Install TSCC codec for avi movies