Preserving Web-based digital material

Preserving Web-baseddigital material Andrea Goethals Harvard University Library Why Books? Site Visit 28 October 2010

Agenda • Why preserve Web content? • A look at the Web • Web archiving • Web archiving at Harvard • Open challenges in Web archiving • Questions?

1. Why preserve Web content?

Books have moved off the shelves and onto the Web!

TV Shows Blogs Images Scholarly papers Stores Discussions Maps Virtual worlds Art exhibits Documents … Music Articles Magazines Newspapers Tutorials Software Databases Social networking Advertising Courses … A few other things on the Web… • Museums • Libraries • Archives • Recipes • Data sets • Oral history • Poetry • Broadcasts • Wikis • Movies • …

But is it valuable?

May be historically significant White House web site March 20, 2003

May be the only version Harvard Magazine May/June 2009

May document human behavior World of Warcraft, Fizzcrank realm, Morc the Orc’s view, Oct. 25 2010

Important to researchers ABC News Aug. 2007

Important to researchers • Strangers and friends: collaborative play in world of warcraft • From tree house to barracks: The social life of guilds in World of Warcraft • The life and death of online gaming communities: a look at guilds in World of Warcraft • Learning conversations in World of Warcraft • The ideal elf: Identity exploration in World of Warcraft • Traffic analysis and modeling for world of warcraft • E-collaboration and e-commerce in virtual worlds: The potential of second life and world of warcraft • Understanding social interaction in world of warcraft • Communication, coordination, and camaraderie in World of Warcraft • An online community as a new tribalism: The world of warcraft • A hybrid cultural ecology: world of warcraft in China • … etc.

May be a work of art YouTube Play. A Biennial of Creative Video (Oct. 2010 -)

May be important data for scholarship NOAA Satellite and Information Service

May be an important reference

May be of personal value

2. A look at the Web

Remember this? • 1993: “First” graphical Web browser (Mosaic)

Volume of content is immense! • 1998: First Google index has 26 million pages • 2000: Google index has 1 billion pages • 2008: Google processes 1 trillion unique URLs • “… and the number of individual Web pages out there is growing by several billion pages per day” • (Source: the official Google blog)

Prolific self-publishers “Humanity’s total digital output currently stands at 8,000,000 petabytes … but is expected to pass 1.2 zettabytes this year. One zettabyte is equal to one million terabytes…” “Around 70 per cent of the world’s digital content is generated by individuals, but it is stored by companies on content-sharing websites such as Flickr and YouTube.” Telegraph.co.uk May 2010 on IDC study

Ever-increasing # of web sites 96 million out of 233 million web sites are active (Netcraft.com)

A moving target • Flickr (Feb 2004) • Facebook (Feb 2004) • YouTube (Feb 2005) • Twitter (2006)

Anatomy of a web page • Typically • 1 web page = ~35 files • 1 HTML file • 7 text/css • 8 image/gif • 17 image/jpeg • 2 javascript Source: representative samples taken by Internet Archive

Can’t rely on it always being out there

Web content is transient • The average lifespan of a web site is between 44 and 100 days Captured April 8, 2009 Visited October 13, 2010

Disappearing web sites • 2000 Sydney Olympics • Most of the Web record is only held by the National Library of Australia • Half of the URLs cited in D-Lib Magazine inaccessible 10 years after publication (McCown et al., 2005)

3. Web archiving

Web archiving 101 • Web harvesting • Select and capture it • Preservation of captured Web content • “Digital preservation” • Keep it safe • Keep it usable to people long-term, despite technological changes acquisition of other digital content acquisition of web content preservation of web content preservation of other digital content

Web harvesting • Download all files needed to reproduce the Web page • Try to capture the original form of the Web page as it would have been experienced at the time of capture • Also collect information about the capture process • Must be some kind of selection…

Type of harvesting • Domain harvesting • Collect the web space of an entire country • The French Web including the .fr domain • Selective harvesting • Collect based on a theme, event, individual, organization, etc. • The London 2012 Olympics • Hurricane Katrina • Women’s blogs • President Obama Any type of regular harvesting results in a large quantity of content to manage.

The crawl Pick a location (Seed URIs) Document exchange Make a request to Web server Examine for URI references Receive response from Web server

Web archiving pioneers: mid-1990s NL of Sweden Internet Archive NL of Denmark Alexa Internet NL of Australia NL of Finland NL of Norway Collecting Partners Adapted from A. Potter’s presentation, IIPC GA 2010

International Internet Preservation Consortium (IIPC): 2003- L and A Canada NL of Sweden NL of Denmark Internet Archive NL of France British Library IIPC NL of Norway Library of Congress NL of Finland NL of Italy NL of Iceland NL of Australia IIPC: http://netpreserve.org

IIPC goals • Facilitate preservation of a rich body of Internet content from around the world • Develop common tools, techniques and standards • Encourage and support Internet archiving and preservation IIPC: http://netpreserve.org

NL of Netherlands IIPC: 2010 WAC (UK) Hanzo Archives NL of Scotland NL of Austria TNA (UK) NL of Israel NL of Singapore NL of Spain / Catalunya NL of Sweden Denmark BANQ Canada European Archive NL of Korea NL of Japan L and A Canada British Library / UK NL of Croatia NL of France / INA Archive-It Partners Internet Archive IIPC Harvard NL of Poland NL of Norway GPO (US) NL of NZ Library of Congress NL of Germany NL of Finland UNT (US) NL of Iceland CDL (US) NL of Australia NYU (US) NL of Slovenia AZ AI Lab (US) UIUC (US) Collecting Partners NL of Switzerland NL of Italy NL of Czech Republic OCLC Collecting Partners Collecting Partners Adapted from A. Potter’s presentation, IIPC GA 2010

Current methods of harvesting • Contract with another party for crawls • Internet Archive’s crawls for the Library of Congress • Use a hosted service • Internet Archive’s ArchiveIt • California Digital Library’s Web Archiving Service (WAS) • Set up institution-specific web archiving systems • Harvard’s Web Archiving Collection Service (WAX) • Most use IIPC tools like the Heritrix web crawler

Current methods of access • Currently dark – no access (e.g. Norway) • Only on-site to researchers (e.g. BnF, Finland) • Public on-line access (e.g. Harvard, LAC) • What kind of access? • Most common: browse as it was • Sometimes: full text search • Very rare: bulk access for research • Non-existent: cross-web archive access http://netpreserve.org/about/archiveList.php

4. Web archiving at Harvard

Web Archiving Collection Service (WAX) • Used by “curators” within Harvard units (departments, libraries, museums, etc.) to collect and preserve Web content • Content selection is a local choice • The content is publicly available to current and future users

WAX workflow • A Harvard unit sets up an account (one-time event) • On an on-going basis: • Curators within that unit specify and schedule content to crawl • WAX crawlers capture the content • Curators QA the Web harvests • Curators organize the Web harvests into collections • Curators make the collections discoverable • Curators push content to the DRS – becomes publicly viewable and searchable

WAX WAX temp storage WAXi curator interface temp index curator back-end services HOLLIS catalog production index WAX public interface archive user DRS (preservation repository) Back end Front end

Back-end services • WAX crawlers • File Movers • Importer • Deleter • Archiver • Indexers

WAX WAX temp storage WAXi curator interface temp index curator back-end services HOLLIS catalog production index WAX public interface archive user DRS (preservation repository) Back end Front end

Minimally at the collection level Sometimes also at the Web site level Catalog record

http://wax.lib.harvard.edu

5. Open challenges in Web archiving

How do we capture…? • Streaming media (e.g. videos) • Non-http protocols (RTMP, etc.), sometimes proprietary • Experiments to capture video content in parallel to regular crawls (e.g. BL’s One & Other project) • Complicates play-back as well • Still experimental, non-scalable and time-consuming

How do we capture…? • Highly interactive sites (Flash, AJAX) • Experiments to launch Web browsers that can simulate Web clicks (INA, European Archive) • Still experimental and time-consuming • “Walled gardens” • Need help from content hosts • What’s next? The Web keeps changing

Preserving Web-based digital material