1 / 15

WEB ARCHIVING issues and challenges

WEB ARCHIVING issues and challenges. Deborah Woodyard Digital Preservation Coordinator. Where to start?. Selection Collection Development Policy Need to be able to find them again Cataloguing issues 404 Not Found Need to capture web sites Who is responsible for capture?

neena
Download Presentation

WEB ARCHIVING issues and challenges

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WEB ARCHIVING issues and challenges Deborah WoodyardDigital Preservation Coordinator

  2. Where to start? • Selection • Collection Development Policy • Need to be able to find them again • Cataloguing issues • 404 Not Found • Need to capture web sites • Who is responsible for capture? • Who is responsible for preservation/access? • What does this mean? • Define a web site - Where are the boundaries: • Links • Content on other sites / other servers • Changes with time – significant change

  3. Technical issues – Capture software • Capture software • Taking ‘Snapshots’ • Follow directory structure or links? • Where to break links / replace broken links? • Relative vs absolute linking • No changes to code for authenticity • Preserve ‘original’ version, provide ‘access’ version • Obey robots.txt exclusions • Politeness – server load • Quality control checking

  4. Technical issues - Web sites • File types - HTML, gif, JPEG, Javascript, asp, etc. etc. etc. • Software plug-ins- permission- access • Dynamic database driven sites- producing static pages- producing pages on-the-fly • Frequency of capture • Extent of capture- volume- duplication- storage and access to partial sites

  5. Technical issues – storage and access • Management and storage- high volume- multiple captures- long term, inc. storage system migration- disaster recovery • Permanent naming • Ensuring authenticity- trusted digital repository- checksums, signatures – long term • Signifying access to archived version

  6. Technical issues - preservation • Preserve bits • Preserve intellectual object, + ‘look & feel’ • Preserve functionality • Technology changes- physical storage- hardware platform- operating systems- application software- HTML

  7. Technical issues – preservation strategies • Metadata for preservation- describe bits: how and where stored- describe how to interpret/use bits- describe the context for the bits • Migration- in part / in whole- valid code?- keep all versions?- manage multiple versions • Emulation- of software / OS / platform

  8. LEGAL DISCUSSION • Minimise risk • Capture non-commercial sites • Preserve without providing access • Embargo or limit access • Document actions taken • Maintain ability to remove access

  9. Cost • £££ ?? • to do it • of not doing it

  10. PROJECTS • General project types: • Selective- narrow, high quality, low volume • Comprehensive- broad, lower quality, high volume • Combination- useful, high quality, high volume

  11. PROJECTS • British Library involvement: • Domain.UK - selective • UK Web Archiving Consortium - selective • International Internet Preservation Consortium (IIPC) – comprehensive/combination

  12. Project details • Domain.uk • WebWhacker, HTTrack • Regular captures of simple sites • Staff PC (later networked drive), very small • No access • UK WAC • UK partners sharing one system • PANDAS management, HTTrack, Oracle • Manual selection, cataloguing and quality checking • Web interface for capture and public access

  13. Project details • IIPC • Comprehensive automated selection- links in / links out- authority / hits- rare words • Designing new crawler / harvester • Developing technical architecture • Deep web? • Access challenging

  14. FUTURE WORK • Expand collection • Collaborative projects, inc. automated capture and metadata generation • Legal deposit instruments for web archiving • Provide restricted access

  15. USEFUL REFERENCES • http://library.wellcome.ac.uk/projects/archiving_reports.shtml • Collecting and preserving the World Wide Web: A feasibility study undertaken for the JISC and Wellcome TrustMichael Day, UKOLN, University of BathVersion 1.0 - 25 February 2003 • Legal issues relating to the archiving of Internet resources in the UK, EU, US and AustraliaAndrew Charlesworth, University of Bristol, Centre for IT and LawVersion 1.0 - 25 February 2003 • 2nd ECDL workshop on Web archivinghttp://bibnum.bnf.fr/ecdl/2002/index.html • Digital Preservation Coalitionhttp://www.dpconline.org/

More Related