1 / 22

The International Internet Preservation Consortium (IIPC)

The International Internet Preservation Consortium (IIPC). Catherine Lupovici Program Officer Bibliothèque nationale de France. IIPC objectives. Provide a forum for sharing knowledge about Internet content archiving both within the Consortium and beyond Develop and recommend standards

shindman
Download Presentation

The International Internet Preservation Consortium (IIPC)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The International Internet Preservation Consortium (IIPC) Catherine Lupovici Program Officer Bibliothèque nationale de France ELAG, Geneva, 31 May - 4 June 2005

  2. IIPC objectives • Provide a forum for sharing knowledge about Internet content archiving both within the Consortium and beyond • Develop and recommend standards • Develop interoperable tools and techniques to acquire, archive and provide access to web sites • Raise awareness of Internet preservation issues and initiatives through conferences, workshops, training events, publications,... ELAG, Geneva, 31 May - 4 June 2005

  3. IIPC • Launched in July 2003 • Members : • Bibliothèque nationale de France, leader • National library of Australia, • Library and Archives Canada • National library of Denmark, • National library of Finland, • National library of Iceland, • National library of Italy, • National library of Norway , • National library of Sweden, • British Library (UK), • Library of Congress (USA) • Internet Archive ELAG, Geneva, 31 May - 4 June 2005

  4. Membership • 12 members currently for the duration of the agreement (-> July 2006) • New members application (end of 2005) for the second phase of the IIPC ELAG, Geneva, 31 May - 4 June 2005

  5. IIPC: a pragmatic approach • Two levels of works : • working groups • projects accepted by the steering committee • Deliverables expected: • tools released under open source free license • recommendations (methodologies, processes, standards,…) ELAG, Geneva, 31 May - 4 June 2005

  6. Working Groups • Framework: architecture and standards • Metrics and Test Bed: defining and implementing a test ground for crawlers • Access tools: development of a toolset • Deep Web: development oftools for deposit and access of database-driven document gateway • Content management:Common vision of collections coverage and complementarity • Researchers requirements: comments and advice on content and access ELAG, Geneva, 31 May - 4 June 2005

  7. IIPC Web Archiving Toolset ELAG, Geneva, 31 May - 4 June 2005

  8. Full set of tools for all the chain • Focused selection and verification • Acquisition • Collection storage and maintenance • Access ELAG, Geneva, 31 May - 4 June 2005

  9. Architecture of tools for Web Archives ELAG, Geneva, 31 May - 4 June 2005

  10. Acquisition Chain (1) • Large scale, archive quality crawler: Heritrix • Specification in early 2003 • Joint developed by IA and the Nordic libraries • Strengths • Good at finding paths to content • Site priority implemented • Very configurable and modular • Next steps • Incremental crawls • Multi-machine ELAG, Geneva, 31 May - 4 June 2005

  11. Acquisition Chain (2) • Smart Archiving Crawler Project • Specification in early 2003 • Joint call for tender by BL and BnF • Goal: to implement large scale, automatically focused crawls • Priority based on citation linking and thematic assessment • Call in October 2004, first prototype mid-2006 ELAG, Geneva, 31 May - 4 June 2005

  12. Acquisition Chain (3) • Deep Arc • Specification in early 2002 • Developped by bnF • Goal: to allow site producer to easily extract DB to XML flat files • Available for test http://deeparc.sourceforge.net/ ELAG, Geneva, 31 May - 4 June 2005

  13. Arc Files managements tools • Several tools already here • Unify and release an official IIPC toolset to • Generate • Parse • Search • Access Arc files • Early 2005 ELAG, Geneva, 31 May - 4 June 2005

  14. Access tools (1) • URI-based access • Display correctly in a controlled environment • Make it browsable in extension and time • URI canonization • Give you appropriate information on the fly • Start from achievements of NWA tools and experience of IA in this domain • Demo http://nwa.nb.no/demo/search.php • 2005 ELAG, Geneva, 31 May - 4 June 2005

  15. Access tools (2) • Large scale indexer • Basic search (boolean, proximity) • Time dimension • Distributed indexing & index • 100 M documents and more • Start from existing open source development: Lucene & Nutch • 2005 ELAG, Geneva, 31 May - 4 June 2005

  16. Access tools (3) • DB query interface generator for databases stored as XML • Developed by NLA with partial IIPC funding • Xinq (XML INQuiry) http://www.nla.gov.au/xinq/ ELAG, Geneva, 31 May - 4 June 2005

  17. IIPC toolkit ready before mid-2006: • Robust & scalable up the to the global web • Implement IIPC standards (ARC 3.0, metadata, API…) • Easy to install and use for advanced user (web archiving engineers) • Open source and available for the community of web archives ELAG, Geneva, 31 May - 4 June 2005

  18. Standards for web archiving and preservation ELAG, Geneva, 31 May - 4 June 2005

  19. Web archives organization Access tool (ex : the Wayback Machine) • ARC: format that will be introduced to ISO TC 46 SC4 • Characteristics • self-described format • extensible • An ARC file is a 100Mb container of harvested objects CDX DAT ARC ELAG, Geneva, 31 May - 4 June 2005

  20. Extract of an ARC file http://www.journal-officiel.gouv.fr/accueil.php 213.244.10.170 20050419153406 text/html 13677 HTTP/1.1 200 OK Date: Tue, 19 Apr 2005 15:34:02 GMT Server: Apache X-Powered-By: PHP/4.3.6 Set-Cookie: PHPSESSID=eafe3f27abe62044804187392c6b991c; path=/ Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache Connection: close Content-Type: text/html <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <title>Journaux Officiels</title> <meta name="author" content="CED"> <meta name="generator" content="WebExpert 5"> <META NAME="description" CONTENT="Consultez le JO lois et décrets, les annonces de marchés publics du BOAMP, les créations da ssociations et fondations, les annonces légales obligatoires du BALO, commandez les ouvrages juridiques et conventions collect ives."> ELAG, Geneva, 31 May - 4 June 2005

  21. The DAT file: the metadata file http://www.bottingourmand.fr/bg_offre_speciale.php?id_article=5159&type=R 212.37.208.33 20041219032127 alexa/dat 1614 m text/html s 200 c fd085d373704ccc4fb4ace29c5c7c2f3 k 911e7f15e64eb9e0399a5796a280a2d0 v 127832757 V 99374535 n 16911 t Bottin Gourmand x www.bottingourmand.fr/bg_style.css y www.bottingourmand.fr/commun/js/pop/sval_pop.js i www.bottingourmand.fr/visu/com/logo.gif l www.bottingourmand.com/ l www.bottingourmand.fr/bg_hotel.php l www.bottingourmand.fr/bg_dedie.php?id_article=594&type=CBG One DAT file per ARC file ELAG, Geneva, 31 May - 4 June 2005

  22. The CDX file: the index file 10-18.fr/10_18/img/btn/btn_cartes.gif 20050304235200 www.10-18.fr image/gif 200 1670a68d91aa6b3960acdc4ed36e1781 - 57743869 BNF-CRAWL-000-RECRAWL-20050304234553-00040-crawling006.archive.org • Index général ELAG, Geneva, 31 May - 4 June 2005

More Related