1 / 27

File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files

File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files. Xiaoming Liu (1) , Luda Balakireva (1) , Patrick Hochstenbach (2) and Herbert Van de Sompel (1) (1) Digital Library Research & Prototyping Team

sawyer
Download Presentation

File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files Xiaoming Liu (1), Luda Balakireva (1), Patrick Hochstenbach (2) and Herbert Van de Sompel (1) (1) Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory (2) University Library Ghent University liu_x@lanl.gov , ludab@lanl.gov , patrick.hochstenbach@ugent.be , herbertv@lanl.gov

  2. Disclaimer • The term Digital Object (DO) will be used as in Kahn/Wilensky: • Compound object • Multiple datastreams of different mime types • Secondary information pertaining to object and datastreams • Identifiers for object (and datastreams) • This is ~ OAIS Content Information

  3. XML-based representation of DOs • Growing interest in XML-based representation of DOs in Digital Library architectures: • Platform-independence, • Industry-support • Longevity, potential migration paths • Processing tools, validation capabilities • XML-based Compound Object formats: • ISO/IEC 21000-2 MPEG-21 DID & DIDL • METS • IMS/CP • CCDS XFDU • Typical functionality: • By-Value (base64) and/or By-Reference provision of constituent datastreams • By-Value and/or By-Reference provision of secondary information • Provision of identifiers

  4. Storing XML-based representations of DOs • Existing approaches: • storage of the XML-representations as individual files in a file system: • Poor access performance • Poor backup performance • storage of the XML-representations in (SQL, XML, object) databases • Long term? Data are dependent on the underlying system • storage of the XML-representations by concatenating many such documents into a single file such as tar or zip • Not XML aware, hence, no use of off-the-shelf XML tools • Increasing storage space (base64-encoding of the constituent datastreams)

  5. aDORe XMLtape/ARCfile solution • Part of LANL aDORe repository effort: • Standards-based, modular repository architecture • Distributed architecture • Protocol-based interactions between modules • Usable to create interoperable federations of heterogeneous repositories • Actual implementation of the architecture at LANL • Components of aDORe software will be released • Inspired by Internet Archive ARC file approach: • File-based mechanism to store datastreams resulting from Web-crawling • Concatenation of multiple datastreams into a single file • Metadata as seperators between datastreams • But not OK to store XML-based representations of DOs: • Metadata capabilities very limited & crawling related • Lose power of XML processing tools

  6. aDORe XMLtape/ARCfile solution • Two interconnected file-based storage mechanisms: • XMLtapes:File storage of XML-based representations of Digital Objects • ARCfiles: File storage of constituent datastreams of Digital Objects • The ARC files are interconnected with one or more XMLtapes during the ingestion process • A protocol-based access mechanism is introduced: • XMLtape is exposed as an autonomous OAI-PMH repository • ARCfile is exposed as an OpenURL Resolver • Write once - Read many: • Files remain stable • Protocol-based access mechanism remains stable • Indexing mechanisms can change as technologies evolve • Storage approach is independent from the compound object format used to represent DOs as XML • aDORe uses MPEG-21 DIDL

  7. based on based on has XML serialization MPEG-21 Abstract Model MPEG-21 DIDL ISO/IEC 21000-2: MPEG-21 DID & DIDL has XML serialization has declaration Digital Item Declaration DIDL document Digital Item

  8. Digital Object Package Representing DOs using MPEG-21 DID sample DIDL document

  9. aDORe XMLtape • An XML file that concatenates the XML-based representations of multiple DOs • Structure is defined by an XML Schema • http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtape.xsd • tape-level administrative section: • Open-ended content • Plug-in for processing-related information, indication of related ARCfiles: • http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtapeBasics.xsd • concatenation of records, each of which consists of: • record-level administrative section • identifier and datestamp of the contained record • other record-level administrative information • a record (can be from any XML Namespace). DIDL in case of aDORe: • http://purl.lanl.gov/aDORe/schemas/2005-08/DIDL.xsd • An XMLtape is a valid and well-formed XML file • Independent from chosen XML-based Compound Object Format

  10. aDORe XMLtape <?xml version="1.0" encoding="UTF-8"?> <ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/" <ta:tapeAdmin> ... </ta:tapeAdmin> <ta:tapeRecord> <ta:tapeRecordAdmin> <ta:identifier>oai:aps.org:PhysRevA.71.040101</ta:identifier> <ta:date>2005-03-29T04:31:22Z</ta:date> <ta:recordAdmin> ... </ta:recordAdmin> </ta:tapeRecordAdmin> <ta:record> <didl:DIDL>...</didl:DIDL> </ta:record> </ta:tapeRecord> </ta:tape> aDORe ta:tape sample XMLtape

  11. record record record record record record record record aDORe XMLtape index XMLtape index identifier datestamp of ingestion identifier datestamp of ingestion identifier datestamp of ingestion Indexing: • Can be achieved with a variety of technologies • Current implementation: Berkeley DB Java Edition <ta:tapeRecordAdmin>

  12. record record record record record record record record aDORe XMLtape as OAI-PMH repository XMLtape index OAI-PMH request DIDL document OAI-PMH identifier = identifier from <ta:tapeRecordAdmin> OAI-PMH datestamp = datetime from <ta:tapeRecordAdmin> OAI-PMH response = content of <ta:record>

  13. Internet Archive ARCfile • Concatenation of binary files • Designed and used by the Internet Archive (Wayback machine) • > 400 TB web data • Under revision by the International Internet Preservation Consortium (IIPC): WARC file format • Input from LANL to facilitate non-Web-crawling use case • The ARC file format is structured as follows: • file header that provides administrative information about the ARC file itself • a sequence of document records, consisting of: • a header line containing some, mainly crawl-related, metadata. • URI of the crawled document • timestamp of acquisition of the data • size of the data block • a response to a protocol request such as an HTTP GET

  14. Internet Archive ARC file filedesc://IA-001102.arc 0 19960923142103 text/plain 761 0 Alexa InternetURL IP-address Archive-date Content-type Archive-length http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202HTTP/1.0 200 Document followsDate: Mon, 04 Nov 1996 14:21:06 GMTServer: NCSA/1.4.1Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMTContent-length: 30<HTML>Hello World!!! </HTML> sample ARC file

  15. Internet Archive ARC file in aDORe filedesc://singletape.arc 0.0.0.0 20050922142103 text/plain 76 1 0 Internet Archive URL IP-address Archive-date Content-type Archive-length info:lanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b989a0.0.0.0 20050907221344 application/pdf 415025 %PDF-1.3 %âãÏÓ 290 0 obj << /Linearized 1 /O 295 /H [ 3642 1057 ] /L 415025 … sample aDORe ARC file sample ARCfile

  16. Internet Archive ARC file ARC index URL datastream URL datastream URL datastream datastream datastream datastream datastream Indexing: • Can be achieved with a variety of technologies • Current implementation in aDORe: Heritrix toolkit datastream URL IP-address Archive-date Content-type Archive-length

  17. ARC file as OpenURL Resolver index ARC file datastream OpenURL OpenURL request datastream datastream datastream datastream datastream datastream datastream datastream Referent Identifier = datastream identifier = URL from ARC record header Resolver Identifier = identifier of ARC file

  18. Associating an XMLtape with ARC Files (1) • A Digital Object is represented using an XML-based Complex Object format (e.g. MPEG-21 DID) • The resulting package (e.g. DIDL document) is stored in an XMLtape • Constituent datastreams of the Digital Object are provided By-Reference: • Using the ref attribute of the Resource element in MPEG-21 DID • The value of the network location of the constituent datastream is compliant with the NISO OpenURL Framework: baseURL(ARCfile OpenURL Resolver)? url_ver = Z39.88-2004 & rft_id = Datastream Identifier & res_id = ARCfile identifier

  19. Associating an XMLtape with ARC Files (1) <?xml version="1.0" encoding="UTF-8"?> <didl:DIDL> …… <didl:Component id="uuid-ddec9dbb-90e5-4b8a-93f3-dd1c8b781547"> <didl:Descriptor> <didl:Statement mimeType="application/xml; charset=utf-8"> <dii:Identifier … > info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b </dii:Identifier> </didl:Statement> </didl:Descriptor> <didl:Resource mimeType="application/pdf“ ref="http://purl.lanl.gov/aDORe/demo/adore-arcfile-resolver/resolver? url_ver=Z39.88-2004 res_id=info:lanl-repo/arc/2001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2 rft_id=info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b“/> </didl:Component> …… </didl:DIDL> Extract from DIDL

  20. Associating an XMLtape with ARC Files (2) • An XMLtape is associated with its corresponding ARCfiles through a plug-in for the XMLtape-level administrative section.

  21. Associating an XMLtape with ARC Files (2) <?xml version="1.0" encoding="UTF-8"?> <ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/"> <ta:tapeAdmin> <tb:XMLtapeBasics xmlns:tb="http://library.lanl.gov/2005-08/aDORe/XMLtapeBasics/“> <tb:XMLtapeId>info:lanl-repo/xmltape/singlescitape</tb:XMLtapeId> <tb:ARCfileId>info:lanl-repo/arc/singlescitape</tb:ARCfileId> <tb:processSoftware>gov.lanl.xmltape.SingleTapeWriter</tb:processSoftware> <tb:processTime>2005-09-07T22:13:39Z</tb:processTime> </tb:XMLtapeBasics> </ta:tapeAdmin> <ta:tapeRecord> <ta:tapeRecordAdmin> … </ta:tape> XMLtape header

  22. DIDL document List of (baseURL, DIDLDocument-id) DIDLDocument-id or content-id DIDLDocument-id or content-id datastream ref datastream-id ref creation datetime index datastream-id index DIDLDocument-id index Identifier Locator DIDLDocument- id datastream id OpenURL XMLtape ARC file AGENT

  23. aDORe XMLtape/ARCfile environment

  24. Implementation • XMLtapes: • Berkeley DB Java Edition • OCLC OAICat • ARCfiles: • Heritrix • OCLC OpenURL software • XMLtape Registry • MySQL db • OCLC OAICat • ARCfile Registry: • MySQL db • OCLC OAICat

  25. Performance indicators • System: • Model: Dell 2650 2U rack-mount server • CPU: dual 2.8 GHz Intel Xeon processors • RAM: 5GB RAM • Disks: 10k RPM SCSI disks • XMLtape: • 1786 MB, 201872 DIDL records • download 100 consecutive DIDL records (787 KB) => 0.18 second • download static file of same size => 0.09 second • ARCfile: • 272 MB,  4910 files • download a sample PDF file (312 KB) => 0.24 second • download static file of same size => 0.036 second

  26. Software • Software - ARC files: • Heritrix: the internet archive's open-source, extensible, web-scale, archival-quality web crawler project. http://crawler.archive.org/ • NetArchive.dk: a project that plans for the preservation of Denmark's cultural heritage on the internet for future generations. http://www.netarchive.dk/ • Many other tools: http://archive-access.sourceforge.Net • XMLtapes: • Perl tool, XML::Tape (LANL & Ghent University), http://search.cpan.org/~hochsten/XML-Tape/ • Combined aDORe XMLtape/ARCfile environment: • Java tool (LANL), soon to be released on SourceForge

  27. Conclusion • The file-based approach is inherently simple, and reduces dependency on database system. • The autonomy of the indexes allows retaining the files over time, while the indexes can be created using other techniques as technologies evolve. • The protocol-based nature of the access increases the flexibility in light of evolving technologies as it introduces another layer of abstraction. • The XMLtape approach is inspired by the ARC file format, but provides several additional attractive features: • Off-the-shelf XML tools can be used to parse/validate an XMLtape • All DO metadata can be stored in XML-based compound object format Presentation available via http://public.lanl.gov/herbertv/ Install TSCC codec for avi movies

More Related