Preserving Web Content: Harvard & Worldwide

Preserving Web Content:Harvard & Worldwide Andrea Goethals | andrea_goethals@harvard.edu | June 23, 2011

Agenda PART 1: The Web PART 2: Web Archiving Today PART 3:Web Archiving at Harvard PART 4: New & Noteworthy

The Web

Breadcrumb 1993: “1st” graphical Web browser, Mosaic Image goes here Citation | UIUC NCSA ftp://ftp.ncsa.uiuc.edu/Web/Mosaic/

“We knew the web was big…” | 7/25/2008, Official Google blog <http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html> • 1998: 1st Google index • 26 million pages • 2000: Google index • 1 billion pages • 2008: Google link processors • 1 trillion unique URIs • “… and the number of individual Web pages out there is growing by several billion pages per day” – from the official Google blog

“Are You Ready?” | 2010 Digital Universe Study, IDC <http://gigaom.files.wordpress.com/2010/05/2010-digital-universe-iview_5-4-10.pdf>|The Telegraph May 4, 2010 <http://www.telegraph.co.uk/technology/news/7675214/Digital-universe-to-smash-zettabyte-barrier-for-first-time.html> 2009: estimated at .8 Zettabytes (1 ZB = 1 billion Terabytes) 2010: estimated at 1.2 ZB 2020: estimated to grow by a factor of 44 from 2009, to 35 ZB

Outpacing storage | 2010 Digital Universe Study, IDC <http://gigaom.files.wordpress.com/2010/05/2010-digital-universe-iview_5-4-10.pdf>

Some of this content is worth keeping • Citizen journalism • News sites • Personal and professional blogs • Scientific knowledge • Political websites • Government websites • Internet history • Comics • Virtual worlds • Alternative magazines • Multimedia Art • Photography • Performances • Hobby sites • Courses • Social networks • Networked games • References • Podcasts • Maps • Documents • Discussion groups • Personal sites • Organizational websites • Citizens groups • Research results • Religious sites • Sports sites • Social causes • History of major events • Humor • Scholarly papers • TV shows • Music • Poetry • Magazines • Books • Fashion • Movies • Art exhibits

Many believe Web content is permanent | “Digital Natives Explore Digital Preservation”, Library of Congress <http://www.youtube.com/watch?v=6fhu7s0AfmM>

Ever seen this? 404 Not Found

Yahoo! Geocities (1994-2009)

Web 2.0 Companies: 2006 By Ludwig Gatzke | Ludwig Gatzke, Posted to flickr on Feb. 19, 2006 <http://www.flickr.com/photos/stabilo-boss/93136022/in/set-72057594060779001/>

Web 2.0 Companies: 2009 By Meg Pickard | Meg Pickard, Posted to flickr on May 16, 2009 <http://www.flickr.com/photos/meg/3528372602/>

Mosaic Hotlist (1995) • 59 URIs • 50 HTTP URIs • 36 Dead • 13 Live • 1 Reused • 9 Gopher URIs • All Dead

A fleeting document of our time period “Every site on the net has its own unique characteristic and if we forget about them then no one in the future will have any idea about what we did and what was important to us in the 21stcentury.” | “America’s Young Archivists”, Library of Congress <http://www.youtube.com/watch?v=Gob7cjzoX3Y>

Web archiving today

Web archiving 101 • Web harvesting • Select and capture it • Preservation of captured Web content • “Digital preservation” • Keep it safe • Keep it usable to people long-term, despite technological changes acquisition of other digital content acquisition of web content preservation of web content preservation of other digital content

Anatomy of a Web page • www.harvard.edu (6/8/2011): 58 files • 19 PNG images • 13 Javascript files • 12 GIF images • 10 JPEG images • 3 CSS files • 1 HTML file Typically 1 web page = ~35 files 17 JPEG images 8 GIF images 7 CSS files 2 Javascript files 1 HTML file (Source: representative samples taken by then Internet Archive)

Web harvesting • Download all files needed to reproduce the Web page • Try to capture the original form of the Web page as it would have been experienced at the time of capture • Also collect information about the capture process • Must be some kind of content selection…

Types of harvesting • Domain harvesting • Collect the web space of an entire country • The French Web including the .fr domain • Selective harvesting • Collect based on a theme, event, individual, organization, etc. • The London 2012 Olympics • Hurricane Katrina • Women’s blogs • President Obama • Planned vs. event-based Any type of regular harvesting results in a large quantity of content to manage.

The crawl Pick a location (Seed URIs) Document exchange Make a request to Web server Examine for URI references Receive response from Web server

Web archiving pioneers: mid-1990s NL of Sweden Internet Archive NL of Denmark Alexa Internet NL of Australia NL of Finland NL of Norway Collecting Partners | Adapted from A . Potter’s presentation, IIPC GA 2010

International Internet Preservation Consortium (IIPC): 2003- L and A Canada NL of Sweden NL of Denmark Internet Archive NL of France British Library IIPC NL of Norway Library of Congress NL of Finland NL of Italy NL of Iceland NL of Australia | IIPC <http://netpreserve.org>

IIPC goals • Enable collection, preservation and long-term access of a rich body of Internet content from around the world • Foster development and use of common tools, techniques and standards • Be a strong advocatefor initiatives and legislation that encourage the collection, preservation and long-term access to Internet content • Encourage and support libraries, archives, museums and cultural heritage institutions everywhere to address Internet content collecting and preservation | IIPC <http://netpreserve.org>

IIPC: 2011 NL of Netherlands/ VKS Bib. Alexandrina WAC (UK) Hanzo Archives NL of Scotland NL ofChina NL of Austria TNA (UK) NL of Israel NL of Singapore NL of Spain / Catalunya NL of Sweden Denmark BANQ Canada InternetMemoryFoundation NL of Korea NL of Japan L and A Canada British Library / UK NL of Croatia NL of France / INA Archive-It Partners Internet Archive IIPC HarvardLibrary NL of Poland NL of Norway GPO (US) NL of NZ Library of Congress NL of Germany NL of Finland UNT (US) NL of Iceland CDL (US) NL of Australia NYU (US) UIUC (US) NL of Slovenia AZ AI Lab (US) Collecting Partners NL of Switzerland NL of Italy OCLC NL of Czech Republic Collecting Partners Collecting Partners

Current methods of harvesting • Contract with other parties for crawls • Internet Archive’s crawls for the Library of Congress • Use a hosted service • Archive-It (provided by the Internet Archive) • Web Archiving Service (WAS) (provided by California Digital Library) • Set up institutional web archiving systems • Harvard’s Web Archiving Collection Service (WAX) • Most use IIPC tools like the Heritrix web crawler

Current methods of access • Currently dark – no access (Norway, Slovenia) • Only on-site (BnF, Finland, Austria) • Public online access (Harvard, LAC, some LC collections) • What kind of access? • Most common: browse as it was & URL search • Sometimes: also full text search • Very rare: bulk access for research • Nonexistent: cross-institutional web archive discovery/access

Current big challenges • Legal • High value content locked up in gated communities (Facebook); who owns what? • Technical • The Web keeps morphing; so must our capture tools • Big data--requires very scalable infrastructure (indexing, de-duplication, format identification, …) • Organizational • Web archiving is very resource intensive and competes with other institutional priorities • Preservation • Many different formats; complex interconnected content; high-maintenance rendering requirements

Web archiving at Harvard

Web Archiving Collection Service (WAX) • Used by “curators” within Harvard units (departments, libraries, museums, etc.) to collect and preserve Web content • Content selection is a local choice • The system is managed centrally by OIS • The content is publicly available to current and future users • The content is preserved in the Digital Repository Service (DRS) managed by OIS

WAX workflow • A Harvard unit sets up an account (one-time event) • On an on-going basis: • Curators within that unit specify and schedule content to crawl • WAX crawlers capture the content • Curators QA the resulting Web harvests • Curators organize the Web harvests into collections • Curators make the collections discoverable • Curators push content to the DRS – becomes publicly viewable and searchable

WAX WAX temp storage WAXi curator interface temp index curator back-end services HOLLIS catalog production index WAX public interface archive user DRS (preservation repository) Back end Front end

Back-end services • WAX crawlers • File Movers • Importer • Deleter • Archiver • Indexers

WAX WAX temp storage WAXi curator interface temp index curator back-end services HOLLIS catalog production index WAX public interface archive user DRS (preservation repository) Back end Front end

Minimally at the collection level Sometimes also at the Web site level Catalog record

http://wax.lib.harvard.edu

New & noteworthy

Web Continuity Project • The problem: 60% of the links cited in British Parliamentary debate transcripts dating from 1996-2006 were broken • The solution: when broken links are found on UK governmental websites deliver archived versions • Sites are crawled 3 times/year → UK Government Web Archive • When users click on a dead link on the live governmental site it automatically redirects to an archived version of that page * patches the present with the past * | Web Continuity Project < http://www.nationalarchives.gov.uk/information-management/policies/web-continuity.htm>

Memento • The problem: the Web of the past, even where it exists, is very difficult to access compared to the Web of the present • The solution: leverage Internet protocols and existing stores of past Web resources (Web archives, content management systems, revision control systems) to allow a user to specify a desired past date of the Web resource to return • Give me http://www.ietf.org as it existed around 1997 • LANL, Old Dominion, Harding University; funded by LC’s NDIIPP * a bridge from the present to the past * | Memento<http://www.mementoweb.org>

Memento Viewing the live Web Viewing the past Web | Memento<http://www.mementoweb.org>

Memento example • Using the Memento browerplugin • User sends a GET/HEAD request to http://www.ietf.org • A “timegate” is returned • User sends a request to the timegate requesting http://www.ietf.org around 1997 • A new HTTP request header “Accept-Datetime” • The timegate returns http://web.archive.org/web/19970107171109/http://www.ietf.org/ • And a new response header “Memento-Datetime” to indicate the date the URI was captured | Memento<http://www.mementoweb.org>

Data mining & analytics • Extract information/insight from a corpus of data (Web archives) • Can help researchers answer interesting questions about society, technology & media use, language, … • This information can enable better UIs for users • Geospatial maps, tag clouds, classification, facets, rate of change • Technical platform & tools for research • Hadoop distributed file system • Map Reduce • Google Refine • Pig Latin (scripting) • IBM BigSheets

How did a blogosphere form? Esther Weltervrede and Anne Helmond

Where do bloggers blog? Esther Weltervrede and Anne Helmond

Shift from linklogs & lifelogs to platformlogs

Collaborative collections • End of Term (EOT) collection (2008) • Before & after federal government’s public Web presence • UNT, LC, CDL, IA, GPO • Winter Olympics 2010 • IA, LC, CDL, BL, UNT, BnF • EOT and presidential elections (2011-12) • UNT, LC, CDL, IA, Harvard • Olympic & Paralympic Games 2012 • BL, ? | Winter Olympics 2010 <http://webarchives.cdlib.org/a/2010olympics>| Olympic & Paralympic Games 2012 <http://www.webarchive.org.uk/ukwa/collection/4325386/page/1>

Emulation / KEEP Project • Problem: how to preserve access to obsolete Web formats • (One) solution: emulate Web browsers • Related projects: • Keeping Emulation Environments Portable (KEEP) • The emulation infrastructure • Knowledgebase of typical Web client environments by year • What was a typical for a given year?

KEEP Emulation Framework • User requests digital file in an obsolete format • The system selects and runs the best available emulator and sets up the software dependencies (OS, apps, plug-ins, drivers) • The emulators run on a KEEP Virtual Machine (VM) so that only the VM needs to be ported over time, not the emulators • 9 European institutions, led by the KB; EU-funded externaltechnicalregistries GUI EF coreengine digitalfile emulatorarchive SW archive KEEP Virtual Machine (portability) | KEEP <http://www.keep-project.eu>

Preserving Web Content: Harvard & Worldwide