160 likes | 265 Views
WEB ARCHIVING issues and challenges. Deborah Woodyard Digital Preservation Coordinator. Where to start?. Selection Collection Development Policy Need to be able to find them again Cataloguing issues 404 Not Found Need to capture web sites Who is responsible for capture?
E N D
WEB ARCHIVING issues and challenges Deborah WoodyardDigital Preservation Coordinator
Where to start? • Selection • Collection Development Policy • Need to be able to find them again • Cataloguing issues • 404 Not Found • Need to capture web sites • Who is responsible for capture? • Who is responsible for preservation/access? • What does this mean? • Define a web site - Where are the boundaries: • Links • Content on other sites / other servers • Changes with time – significant change
Technical issues – Capture software • Capture software • Taking ‘Snapshots’ • Follow directory structure or links? • Where to break links / replace broken links? • Relative vs absolute linking • No changes to code for authenticity • Preserve ‘original’ version, provide ‘access’ version • Obey robots.txt exclusions • Politeness – server load • Quality control checking
Technical issues - Web sites • File types - HTML, gif, JPEG, Javascript, asp, etc. etc. etc. • Software plug-ins- permission- access • Dynamic database driven sites- producing static pages- producing pages on-the-fly • Frequency of capture • Extent of capture- volume- duplication- storage and access to partial sites
Technical issues – storage and access • Management and storage- high volume- multiple captures- long term, inc. storage system migration- disaster recovery • Permanent naming • Ensuring authenticity- trusted digital repository- checksums, signatures – long term • Signifying access to archived version
Technical issues - preservation • Preserve bits • Preserve intellectual object, + ‘look & feel’ • Preserve functionality • Technology changes- physical storage- hardware platform- operating systems- application software- HTML
Technical issues – preservation strategies • Metadata for preservation- describe bits: how and where stored- describe how to interpret/use bits- describe the context for the bits • Migration- in part / in whole- valid code?- keep all versions?- manage multiple versions • Emulation- of software / OS / platform
LEGAL DISCUSSION • Minimise risk • Capture non-commercial sites • Preserve without providing access • Embargo or limit access • Document actions taken • Maintain ability to remove access
Cost • £££ ?? • to do it • of not doing it
PROJECTS • General project types: • Selective- narrow, high quality, low volume • Comprehensive- broad, lower quality, high volume • Combination- useful, high quality, high volume
PROJECTS • British Library involvement: • Domain.UK - selective • UK Web Archiving Consortium - selective • International Internet Preservation Consortium (IIPC) – comprehensive/combination
Project details • Domain.uk • WebWhacker, HTTrack • Regular captures of simple sites • Staff PC (later networked drive), very small • No access • UK WAC • UK partners sharing one system • PANDAS management, HTTrack, Oracle • Manual selection, cataloguing and quality checking • Web interface for capture and public access
Project details • IIPC • Comprehensive automated selection- links in / links out- authority / hits- rare words • Designing new crawler / harvester • Developing technical architecture • Deep web? • Access challenging
FUTURE WORK • Expand collection • Collaborative projects, inc. automated capture and metadata generation • Legal deposit instruments for web archiving • Provide restricted access
USEFUL REFERENCES • http://library.wellcome.ac.uk/projects/archiving_reports.shtml • Collecting and preserving the World Wide Web: A feasibility study undertaken for the JISC and Wellcome TrustMichael Day, UKOLN, University of BathVersion 1.0 - 25 February 2003 • Legal issues relating to the archiving of Internet resources in the UK, EU, US and AustraliaAndrew Charlesworth, University of Bristol, Centre for IT and LawVersion 1.0 - 25 February 2003 • 2nd ECDL workshop on Web archivinghttp://bibnum.bnf.fr/ecdl/2002/index.html • Digital Preservation Coalitionhttp://www.dpconline.org/