A New Model for Web Resource Harvesting

A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon Warner Terry Harrison This work supported in part by the Andrew Mellon Foundation & Library of Congress

My Research Interests • Digital Objects and Repositories • Interaction between them: roles, responsibilities, architecture, scalability • Bumper sticker: Free the Object from the Tyranny of the Repository • Digital Preservation • Shared infrastructure models, automation, large-scale best effort strategies • Bumper sticker: We Need Fewer Heroes • User / System Co-Evolution • Discerning intent from access large-scale patterns • Bumper sticker: Freedom From Choice

Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research

WWW and DL: Separated at Birth The Good: XML, BitTorrent, Web Services The Bad: RSS The Ugly: Semantic Web WWW WWW DL DL The Good: OAIS, DOI, OAI-PMH The Bad: Dublin Core The Ugly: SRU/W Today 1994 The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered. more on DL/WWW, from the NSF Post-DL Workshop: http://www.sis.pitt.edu/~dlwkshop/paper_sompel.html http://www.sis.pitt.edu/~dlwkshop/paper_lagoze.html

what is this file? what are its relationships to other files? how often does it change? Web Robots what documents have been modified since 2003-11-15 ? www.getty.edu … doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11 robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG

<co> <metadata/> <link/> <link/> <change/> … </co> A More Efficient Way what documents have been modified since 2003-11-15 ? www.getty.edu with mod_oai … doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11

A Very Brief OAI-PMH History • Universal Preprint Service • A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives • not distributed searching • Demonstrated at Santa Fe NM, October 21-22, 1999 • D-Lib Magazine, 6(2) 2000 (2 articles) • http://www.dlib.org/dlib/february00/02contents.html • UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/ • The OAI has authored, among other things, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) • in use around the world, 600+ known instances • registration not required; many unknown instances • http://gita.grainger.uiuc.edu/registry/ • http://celestial.eprints.org/cgi-bin/status • used by Google ca. late 2004 • http://www.nla.gov.au/digicoll/oai/

Data Providers / Repositories Service Providers / Harvesters “A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters.” “A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories.”

Aggregators • aggregators allow for: • scalability for OAI-PMH • load balancing • community building • discovery data providers (repositories) service providers (harvesters) aggregator

resource OAI-PMH sets OAI-PMH identifier item OAI-PMH identifier metadataPrefix datestamp Dublin Core metadata MARCXML metadata records OAI-PMH data model entry point to all records pertaining to the resource metadata pertaining to the resource

Overview of OAI-PMH Verbs metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

Resource Harvesting: Use cases • Discovery: use content itself in the creation of services • search engines that make full-text searchable • citation indexing systems that extract references from the full-text content • browsing interfaces that include thumbnail versions of high-quality images from cultural heritage collections • Preservation: • periodically transfer digital content from a data repository to one or more trusted digital repositories • trusted digital repositories need a mechanism to automatically synchronize with the originating data repository Ideas first presented in Van de Sompel, Nelson, Lagoze & Warner, http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html

Existing OAI-PMH based approaches Typical scenario: • An OAI-PMH harvester harvests Dublin Core records from the OAI-PMH repository. • The harvester analyzes each Dublin Core record, extracting dc.identifier information in order to determine the network location of the described resource. • A separate process, out-of-band from the OAI-PMH, collects the described resource from its network location.

Existing OAI-PMH based approaches : Issue 1 • Locating the resource based on information provided in dc.identifier • dc.identifier used to convey a variety of identifier: (simultaneously) URL DOI, bibliographic citation, … Not expressive enough to distinguish between identifier, locator. • Several dereferencing attempts required • URI provided in dc.identifier is commonly that of a bibliographic “splash page” • How to know it is a bibliographic “splash page”, not the resource? • If it is a bibliographic “splash page”, where is the resource?

Existing OAI-PMH based approaches : Issue 2 • Using the OAI-PMH datestamp of the Dublin Core record to trigger incremental harvesting: • Datestamp of DC record does not necessarily change when resource changes

Existing OAI-PMH based approaches : Conventions • Cannot really address issue 2 (datestamps) with metadata conventions • Issue 1 (identifier & locator of the resource) is currently addressed with a range of conventions • First dc.identifier is locator of the resource • what if the resource is not digital? • Use of dc.format and/or dc.relation to convey locator

splash page locator of resource Existing OAI-PMH based approaches : Conventions

splash page splash page locator of resource Existing OAI-PMH based approaches : Conventions

Existing OAI-PMH based approaches : Other attempts • dc.identifier leads to splash page & splash page contains special purpose XHTML link to resource(s) • What if there is no splash page? • How does a harvester recognize this situation? • OA-X: protocol extension • OK in local context • Strategic problem to generalize • How to consolidate with OAI-PMH data model • Qualified Dublin Core • Could bring expressiveness to distinguish between locator & identifier • But what about the datestamp issue?

Proposed OAI-PMH based approach • Use metadata formats that were specifically created for representation of digital objects: • Complex Object Formats as OAI-PMH metadata formats • MPEG-21 DIDL, METS, ..

resource item Dublin Core metadata MARCXML metadata MPEG-21 DIDL METS records OAI-PMH data model OAI-PMH identifier = entry point to all records pertaining to the resource metadata pertaining to the resource simple more expressive highly expressive highly expressive

Complex Object Formats : characteristics • Representation of a digital object by means of a wrapper XML document. • Represented resource can be: • simple digital object (consisting of a single datastream) • compound digital object (consisting of multiple datastreams) • Unambiguous approach to convey identifiers of the digital object and its constituent datastreams. • Include datastream: • By-Value: embedding of base64-encoded datastream • By-Reference: embedding network location of the datastream • not mutually exclusive; equivalent • Include a variety of secondary information • By-Value • By-Reference • Descriptive metadata, rights information, technical metadata, …

Complex Object Formats & OAI-PMH • Resource represented via XML wrapper => OAI-PMH <metadata> • Uniform solution for simple & compound objects • Unambiguous expression of locator of datastream • Disambiguation between locators & identifiers • OAI-PMH datestamp changes whenever the resource (datastreams & secondary information) changes • OAI-PMH semantics apply: “about” containers, set membership

OAI-PMH based approach using Complex Object Format Typical scenario: • An OAI-PMH harvester checks for support of a locally understood complex object format using the ListMetadataFormats verb • The harvester harvests the complex object metadata. Semantics of the OAI-PMH datestamp guarantee that new and modified resources are detected. • A parser at the end of the harvesting application analyzes each harvested complex object record: • The parser extracts the bitstreams that were delivered By-Value. • The parser extracts the unambiguous references to the network location of bitstreams delivered By-Reference. • A separate process, out-of-band from the OAI-PMH, collects the bitstreams delivered By-Reference from the extracted network locations.

Complex Object Formats & OAI-PMH : issues • Which Complex Object Format(s) • How to Profile Complex Object Format(s) for OAI-PMH Harvesting • Large records • Making resources re-harvestable • Because the resource is represented as <metadata>, can rights pertaining to the resource be expressed according to the “rights for metadata” OAI-rights guideline? • Tools: • Software library to write compliant complex objects • Integration of this library with repository systems (Fedora, DSpace, eprints.org, ….)

mod_oai approach • Goal: integrate OAI-PMH functionality into the web server itself… • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server • written in C • respects values in .htaccess, httpd.conf • compile mod_oai on http://www.foo.edu/ • baseURL is now http://www.foo.edu/modoai • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) • http://www.foo.edu/modoai? verb=ListIdentifiers & metdataPrefix=oai_dc & from=2004-09-15 & set=mime:video:mpeg

resource OAI-PMH sets MIME type item HTTP header metadata Dublin Core metadata MPEG-21 DIDL records OAI-PMH data model in mod_oai http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf OAI-PMH identifier = entry point to all records pertaining to the resource metadata pertaining to the resource

OAI-PMH concepts : typical repository

OAI-PMH concepts : mod_oai

Resource Discovery: ListIdentifiers harvester • issues a ListIdentifiers, • finds URLs of updated resources • does HTTP GETs updates only • can get URLs of resources with specified MIME types

Preservation: ListRecords harvester • issues a ListRecords, • Gets updates as MPEG-21 DIDL documents (HTTP headers, resource By Value or By Reference) • can get resources with specified MIME types

performance of mod_oai and wget on www.cs.odu.edu for more detail: “mod_oai: An Apache Module for Metadata Harvesting “ http://arxiv.org/abs/cs.DL/0503069

Issues and Future Work • For a given server, there are a set of URLs, U, and a set of files F • Apache maps U F • mod_oai maps F U • Neither function is 1-1 nor onto • We can easily check if a single u maps to F, but given F we cannot (easily) generate U • Short-term issues: • dynamic files • exporting unprocessed server-side files would be a security hole • IndexIgnore • httpd will “hide” valid URLs • File permissions • httpd will advertise files it cannot read • Long-term issues • Alias, Location • files can be covered up by the httpd • UserDir • interactions between the httpd and the filesystem

IndexIgnore & File Permissions

Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs http://server/A http://server/B

UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf %

Looking Further Down the Road for mod_oai • “Reverse” the method of URL discovery • cannot look to the files; • listen to incoming requests and build a list of valid URLs • could be seeded with files at start • also the method for handling server processed files / URLs • Plug-ins for descriptive metadata • DC tags in HTML • MS Office formats, PDF • Tags from JPEG, TIFF, MP3, etc. • Additional metadata in the DIDL • technical metadata from JHOVE • estimated change rate • cf. Cho & Garcia-Molina, ACM TOIT 28(4) • http log access as separate metadata formats • cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)

Expanding OAI-PMH / Complex Object Access • OAI-PMH / CO access for: • blogs • message boards • native file systems • e.g. Mac OS X “Spotlight” • More aggressive use of OAI-PMH / CO for preservation • recently funded NSF DIGARCH program • use for preservation: • Usenet • Email • Multicasting

OAI-PMH + Complex Objects:A New Model for Web Resource Harvesting • Better web harvesting can be achieved through: • OAI-PMH: structured access to updates • Complex object formats: modeled representation of digital objects • Use cases: • Preservation (ListRecords) • Web crawling (ListIdentifiers) • mod_oai: reference implementation • Better performance than wget • static files only; dynamic files in the future • not a replacement for DSpace, Fedora, eprints.org, etc. • More info: • http://www.modoai.org/ • http://whiskey.cs.odu.edu/

A New Model for Web Resource Harvesting