210 likes | 303 Views
mod_oai: Metadata Harvesting for Everyone. Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango {mln,aelango}@cs.odu.edu {herbertv,liu_x}@lanl.gov DLF 2004 Fall Forum Baltimore MD October 25-27, 2004. mod_oai is sponsored by the Andrew Mellon Foundation. Outline.
E N D
mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango {mln,aelango}@cs.odu.edu {herbertv,liu_x}@lanl.gov DLF 2004 Fall Forum Baltimore MD October 25-27, 2004 mod_oai is sponsored by the Andrew Mellon Foundation
Outline • mod_oai • crawling vs. harvesting • complex objects & OAI-PMH • how mod_oai works • scenarios • demos • More information • http://www.modoai.org/ • http://www.openarchives.org/
Inefficient Web Crawlers what documents have been modified since 2003-11-15? www.getty.edu doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc3; last mod 2003-11-29 doc4; last mod 2002-10-03 … doc100; last mod 2003-09-113 robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG
what documents have been modified since 2003-11-15? www.getty.edu with OAI-PMH doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc3; last mod 2003-11-29 doc4; last mod 2002-10-03 … doc100; last mod 2003-09-113 A More Efficient Way…
mod_oai • Goal: integrate OAI-PMH functionality into the web server itself… • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server • written in C • respects values in .htaccess, httpd.conf • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) • www.foo.edu/modoai?ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=video:mpeg
resource item Dublin Core metadata MARCXML metadata MPEG-21 DIDL METS records OAI-PMH data model OAI-PMH identifier = entry point to all records pertaining to the resource metadata pertaining to the resource modeled representation of the resource simple model complex model complex model more expressive model
OAI-PMH and complex models • OAI-PMH record == modeled representation of the resource • Can be selectively harvested via OAI-PMH ~ datestamp, set • Resource can be: • simple object (1 file) • compound object (multiple files) • OAI-PMH records can contain: • Typical metadata • Actual resource(s) • By-Value – base64 encoded • By-Reference – http address of resource • both • Identifiers of metadata and resource(s), unambiguously mapped to the identified data • A variety of secondary information
Complex Objects & OAI-PMH • LANL Repository • OAI-PMH as a Repository Access Protocol to access metadata and content represented as DIDLs • APS/LANL/LoC Mirroring • OAI-PMH transfer of APS content represented in application neutral format (DIDLs) • LANL DSpace Plug-in • Exposes MPEG-21 DIDL documents through built-in DSpace OAI-PMH infrastructure
How mod_oai works • Install on an Apache 2.0 server • compile & edit httpd.conf http://www.foo.edu/ now has an OAI-PMH baseURL of: http://www.foo.edu/modoai
OAI-PMH Data Model in mod_oai resource OAI Identifier == URL of Resource http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf DC, HTTP, DIDL Modeled Representations Set membership == MIME type item Dublin Core metadata HTTP headers DIDL: base64 or urls + HTTP headers records
Use Cases • Regular Web Crawling • use ListIdentifiers to discover URLs • add new URLs to the list of URLs to be crawled • Harvesting Resources w/ OAI-PMH • use ListRecords to extract the entire resource as an MPEG-21 DIDL AIP
Regular Crawling: ListIdentifiers harvester issues a ListIdentifiers, finds the updates, and does HTTP GETs on just the updates
Resource Harvesting: ListRecords harvester issues a ListRecords, and gets the updates in DIDLs (http headers + by-value or by-ref resources)
Demo • Repository Explorer • http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai • http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai?archive=http://whiskey.cs.odu.edu/modoai • Direct URLs • http://whiskey.cs.odu.edu/modoai?verb=Identify • http://whiskey.cs.odu.edu/modoai?verb=ListMetadataFormats • http://whiskey.cs.odu.edu/modoai?verb=ListIdentifiers&metadataPrefix=oai_dc • http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadataPrefix=http_header • http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadataPrefix=oai_didl
Datestamps and Etags L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf • Procedure • 16 harvests over 1 month of 465,374 .dk domains • 5,543,470 possible downloads • 5,182,034 successful downloads • 599,143 changes Datestamp and Etag Example
Errors in Datestamps and EtagsIndicating Change 40.1 % of pages without Etags 0.07% of pages without Datestamps L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf
is: a simple way to more efficiently harvest web pages a possible impact on robots.txt fully OAI-PMH compliant works with existing harvesters is not: yet suitable for dynamic files a replacement for DSpace Fedora eprints.org other digital libraries / repositories / cms mod_oai…