Web Archiving at the Internet Archive - Collection Policies, Challenges, and Strategies

Web Archiving @ The Internet Archive

Agenda • Brief Introduction to IA • Web Archiving Collection Policies and Strategies • Key Challenges (opportunities for broader collaboration…)

What is the Internet Archive? • A digital library established in 1996 that contains over four and a half petabytes (compressed) of publicly accessible digital archival material • A 501(c)(3) non profit organization • A technology partner to libraries, archives, museums, universities, research institutes, and memory institutions • Currently archiving books, texts, film, video, audio, images, software, educational content, television, and the Internet… www.archive.org

Data Storage & Preservation

IA’s Web archive spans 1996-present & includes over 150 billion web instances Develop freely available, open source, web archiving & access tools (Heritrix, Wayback, NutchWAX…) Provide services that enable partners to drive their web archiving programs Perform crawls & host collections for libraries, archives, universities, museums, & other memory institutions www.archive.org/web/www.archiveit.org

Today’s Landscape “The current size of the world’s digital content is equivalent to all the information that could be stored on 75bn Apple iPads, or the amount [of data] that would be generated by everyone in the world posting messages on Twitter constantly for a century. ..” SRc: UK Telegraph http://www.telegraph.co.uk/technology/news/7675214/Zettabytes-overtake-petabytes-as-largest-unit-of-digital-measurement.html IDC annual survey, released May 2010

Today’s Web Landscape • Google: “seen well over 1 trillion unique URLs” • Actual indexed pages: • tens of billions+ (~40-50bil?) • Cuil: “127 bil web pages” (July 15, 2010) • Hundreds of millions of “sites” • Site: publishing network endpoint; One page to millions per site • Diversity of content – streamed, social, interactive…

Collection Policies & Strategies • Crawl Strategies 1) Broad, web-wide surveys from every domain, in every language, including media and text, static and interactive interfaces 2) Organic link discovery at all levels of a host/site 3) End of life, exhaustive harvests 4) Selective/Thematic & resource-specific harvests • Key Inputs: registry data, trusted directories, wikipedia, subject matter experts, prior crawl data • Frequency: usually ongoing but at least Yrly…

Typical Challenges of Archiving the Web • Harvests are at best samples • Time & expense: can’t get everything • Rate of change: don’t get every version • Rate of collection: issues of ‘time skew’ • User agents/ Protocols

Typical Challenges, cont. • Publisher right to opt “in” or “out” • Content behind log-ins can not be archived w/o credentials • Content can be blocked by robots.txt files (which our crawlers respect by default) • Structure of the sites/urls make it very hard to capture only the content of interest. Each site has its own unique set of challenges. • Some parts of sites are not “archive-friendly” (i.e. complex javascript, flash, etc.) • These sites tend to change both their technical structure and policy quickly and often.

Challenges, cont. ~70% of the world’s digital content is now generated by individuals SRc: UK Telegraph, IDC annual survey, released May 2010 Social networks and collaborative/semi-private spaces Immersive Worlds

Web QA & Analysis Daunting scale, requires multi-layered approach • Automated QA to identify missing files used to render pages and prioritize URI’s for harvest • Filtering of spam and content farms discovered during harvest and post harvest • Randomized, representative, human critique of “in” vs “out” of scope per given legal mandate • Advanced analyses: Web and link graphing, text mining

Key Challenges • Not all data can be crawled, need diverse methods of data collection • Data may be lost no matter how carefully it is managed • Need to keep multiple, distributed copies! • Harvested data can be hard to make accessible in a compelling way, on an ongoing basis, at *every* scale • Research and experimentation are essential to keep pace publisher innovation, partnerships are the only way to “keep up” & to support demands of ongoing operations

Key Challenges • Manageable Costs/Sustainable Approaches • Access to power & other critical operational resources • Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources • Support for on demand assembly of collections from aggregate data sets • Timeliness of collection & access • Intuitive interfaces for discovering & navigating resources over time, including robust APIs • Recruitment of engineering talent • Funding

Thank You! Kris Carpenter Negulescu Director, Web Group Internet Archive kcarpenter [at] archive [dot] org

Web Archiving at the Internet Archive - Collection Policies, Challenges, and Strategies