220 likes | 229 Views
The International Internet Preservation Consortium (IIPC). Catherine Lupovici Program Officer Bibliothèque nationale de France. IIPC objectives. Provide a forum for sharing knowledge about Internet content archiving both within the Consortium and beyond Develop and recommend standards
E N D
The International Internet Preservation Consortium (IIPC) Catherine Lupovici Program Officer Bibliothèque nationale de France ELAG, Geneva, 31 May - 4 June 2005
IIPC objectives • Provide a forum for sharing knowledge about Internet content archiving both within the Consortium and beyond • Develop and recommend standards • Develop interoperable tools and techniques to acquire, archive and provide access to web sites • Raise awareness of Internet preservation issues and initiatives through conferences, workshops, training events, publications,... ELAG, Geneva, 31 May - 4 June 2005
IIPC • Launched in July 2003 • Members : • Bibliothèque nationale de France, leader • National library of Australia, • Library and Archives Canada • National library of Denmark, • National library of Finland, • National library of Iceland, • National library of Italy, • National library of Norway , • National library of Sweden, • British Library (UK), • Library of Congress (USA) • Internet Archive ELAG, Geneva, 31 May - 4 June 2005
Membership • 12 members currently for the duration of the agreement (-> July 2006) • New members application (end of 2005) for the second phase of the IIPC ELAG, Geneva, 31 May - 4 June 2005
IIPC: a pragmatic approach • Two levels of works : • working groups • projects accepted by the steering committee • Deliverables expected: • tools released under open source free license • recommendations (methodologies, processes, standards,…) ELAG, Geneva, 31 May - 4 June 2005
Working Groups • Framework: architecture and standards • Metrics and Test Bed: defining and implementing a test ground for crawlers • Access tools: development of a toolset • Deep Web: development oftools for deposit and access of database-driven document gateway • Content management:Common vision of collections coverage and complementarity • Researchers requirements: comments and advice on content and access ELAG, Geneva, 31 May - 4 June 2005
IIPC Web Archiving Toolset ELAG, Geneva, 31 May - 4 June 2005
Full set of tools for all the chain • Focused selection and verification • Acquisition • Collection storage and maintenance • Access ELAG, Geneva, 31 May - 4 June 2005
Architecture of tools for Web Archives ELAG, Geneva, 31 May - 4 June 2005
Acquisition Chain (1) • Large scale, archive quality crawler: Heritrix • Specification in early 2003 • Joint developed by IA and the Nordic libraries • Strengths • Good at finding paths to content • Site priority implemented • Very configurable and modular • Next steps • Incremental crawls • Multi-machine ELAG, Geneva, 31 May - 4 June 2005
Acquisition Chain (2) • Smart Archiving Crawler Project • Specification in early 2003 • Joint call for tender by BL and BnF • Goal: to implement large scale, automatically focused crawls • Priority based on citation linking and thematic assessment • Call in October 2004, first prototype mid-2006 ELAG, Geneva, 31 May - 4 June 2005
Acquisition Chain (3) • Deep Arc • Specification in early 2002 • Developped by bnF • Goal: to allow site producer to easily extract DB to XML flat files • Available for test http://deeparc.sourceforge.net/ ELAG, Geneva, 31 May - 4 June 2005
Arc Files managements tools • Several tools already here • Unify and release an official IIPC toolset to • Generate • Parse • Search • Access Arc files • Early 2005 ELAG, Geneva, 31 May - 4 June 2005
Access tools (1) • URI-based access • Display correctly in a controlled environment • Make it browsable in extension and time • URI canonization • Give you appropriate information on the fly • Start from achievements of NWA tools and experience of IA in this domain • Demo http://nwa.nb.no/demo/search.php • 2005 ELAG, Geneva, 31 May - 4 June 2005
Access tools (2) • Large scale indexer • Basic search (boolean, proximity) • Time dimension • Distributed indexing & index • 100 M documents and more • Start from existing open source development: Lucene & Nutch • 2005 ELAG, Geneva, 31 May - 4 June 2005
Access tools (3) • DB query interface generator for databases stored as XML • Developed by NLA with partial IIPC funding • Xinq (XML INQuiry) http://www.nla.gov.au/xinq/ ELAG, Geneva, 31 May - 4 June 2005
IIPC toolkit ready before mid-2006: • Robust & scalable up the to the global web • Implement IIPC standards (ARC 3.0, metadata, API…) • Easy to install and use for advanced user (web archiving engineers) • Open source and available for the community of web archives ELAG, Geneva, 31 May - 4 June 2005
Standards for web archiving and preservation ELAG, Geneva, 31 May - 4 June 2005
Web archives organization Access tool (ex : the Wayback Machine) • ARC: format that will be introduced to ISO TC 46 SC4 • Characteristics • self-described format • extensible • An ARC file is a 100Mb container of harvested objects CDX DAT ARC ELAG, Geneva, 31 May - 4 June 2005
Extract of an ARC file http://www.journal-officiel.gouv.fr/accueil.php 213.244.10.170 20050419153406 text/html 13677 HTTP/1.1 200 OK Date: Tue, 19 Apr 2005 15:34:02 GMT Server: Apache X-Powered-By: PHP/4.3.6 Set-Cookie: PHPSESSID=eafe3f27abe62044804187392c6b991c; path=/ Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache Connection: close Content-Type: text/html <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <title>Journaux Officiels</title> <meta name="author" content="CED"> <meta name="generator" content="WebExpert 5"> <META NAME="description" CONTENT="Consultez le JO lois et décrets, les annonces de marchés publics du BOAMP, les créations da ssociations et fondations, les annonces légales obligatoires du BALO, commandez les ouvrages juridiques et conventions collect ives."> ELAG, Geneva, 31 May - 4 June 2005
The DAT file: the metadata file http://www.bottingourmand.fr/bg_offre_speciale.php?id_article=5159&type=R 212.37.208.33 20041219032127 alexa/dat 1614 m text/html s 200 c fd085d373704ccc4fb4ace29c5c7c2f3 k 911e7f15e64eb9e0399a5796a280a2d0 v 127832757 V 99374535 n 16911 t Bottin Gourmand x www.bottingourmand.fr/bg_style.css y www.bottingourmand.fr/commun/js/pop/sval_pop.js i www.bottingourmand.fr/visu/com/logo.gif l www.bottingourmand.com/ l www.bottingourmand.fr/bg_hotel.php l www.bottingourmand.fr/bg_dedie.php?id_article=594&type=CBG One DAT file per ARC file ELAG, Geneva, 31 May - 4 June 2005
The CDX file: the index file 10-18.fr/10_18/img/btn/btn_cartes.gif 20050304235200 www.10-18.fr image/gif 200 1670a68d91aa6b3960acdc4ed36e1781 - 57743869 BNF-CRAWL-000-RECRAWL-20050304234553-00040-crawling006.archive.org • Index général ELAG, Geneva, 31 May - 4 June 2005