220 likes | 229 Views
This research paper presents mod_oai, an Apache module that integrates OAI-PMH functionality into the web server to enable web harvesting with OAI-PMH semantics. It discusses the problem of web resource discovery and outlines future research directions.
E N D
A New Model for Web Resource Harvesting Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory Her This work supported in part by the Andrew Mellon Foundation & Library of Congress
Outline (0) The Problem (1) mod_oai (2) Future Research
WWW and DL: Separated at Birth The Good: XML, BitTorrent, Web Services The Bad: RSS The Ugly: Semantic Web WWW WWW DL DL The Good: OAIS, DOI, OAI-PMH The Bad: Dublin Core The Ugly: SRU/W Today 1994 The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered.
what is this file? what are its relationships to other files? how often does it change? Web Robots what documents have been modified since 2003-11-15 ? www.getty.edu … doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11 robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG
<co> <metadata/> <link/> <link/> <change/> … </co> A More Efficient Way what documents have been modified since 2003-11-15 ? www.getty.edu with mod_oai … doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11
Outline (0) The Problem (1) mod_oai (2) Future Research
mod_oai approach • Goal: integrate OAI-PMH functionality into the web server itself… • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server • written in C • respects values in .htaccess, httpd.conf • compile mod_oai on http://www.foo.edu/ • baseURL is now http://www.foo.edu/modoai • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) • http://www.foo.edu/modoai? verb=ListIdentifiers & metdataPrefix=oai_dc & from=2004-09-15 & set=mime:video:mpeg
resource OAI-PMH sets MIME type item HTTP header metadata Dublin Core metadata MPEG-21 DIDL records OAI-PMH data model in mod_oai http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf OAI-PMH identifier = entry point to all records pertaining to the resource metadata pertaining to the resource
Resource Discovery: ListIdentifiers harvester • issues a ListIdentifiers, • finds URLs of updated resources • does HTTP GETs updates only • can get URLs of resources with specified MIME types
Preservation: ListRecords harvester • issues a ListRecords, • Gets updates as MPEG-21 DIDL documents (HTTP headers, resource By Value or By Reference) • can get resources with specified MIME types
performance of mod_oai and wget on www.cs.odu.edu
Readings • Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Terry L. Harrison, Nathan McFarland. mod_oai: An Apache Module for Metadata Harvesting. http://arxiv.org/abs/cs.DL/0503069
Outline (0) The Problem (1) mod_oai (2) Future Research
Issues and Future Work • For a given server, there are a set of URLs, U, and a set of files F • Apache maps U F • mod_oai maps F U • Neither function is 1-1 nor onto • We can easily check if a single u maps to F, but given F we cannot (easily) generate U • Short-term issues: • dynamic files • exporting unprocessed server-side files would be a security hole • IndexIgnore • httpd will “hide” valid URLs • File permissions • httpd will advertise files it cannot read • Long-term issues • Alias, Location • files can be covered up by the httpd • UserDir • interactions between the httpd and the filesystem
Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs http://server/A http://server/B
UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf %
Looking Further Down the Road for mod_oai • “Reverse” the method of URL discovery • cannot look to the files; • listen to incoming requests and build a list of valid URLs • could be seeded with files at start • also the method for handling server processed files / URLs • Plug-ins for descriptive metadata • DC tags in HTML • MS Office formats, PDF • Tags from JPEG, TIFF, MP3, etc. • Additional metadata in the DIDL • technical metadata from JHOVE • estimated change rate • cf. Cho & Garcia-Molina, ACM TOIT 28(4) • http log access as separate metadata formats • cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)
Expanding OAI-PMH / Complex Object Access • OAI-PMH / CO access for: • blogs • message boards • native file systems • e.g. Mac OS X “Spotlight” • More aggressive use of OAI-PMH / CO for preservation • recently funded NSF DIGARCH program • use for preservation: • Usenet • Email • Multicasting
OAI-PMH + Complex Objects:A New Model for Web Resource Harvesting • Better web harvesting can be achieved through: • OAI-PMH: structured access to updates • Complex object formats: modeled representation of digital objects • Use cases: • Preservation (ListRecords) • Web crawling (ListIdentifiers) • mod_oai: reference implementation • Better performance than wget • static files only; dynamic files in the future • not a replacement for DSpace, Fedora, eprints.org, etc. • More info: • http://www.modoai.org/ • http://whiskey.cs.odu.edu/