1 / 16

Web Archiving at the Internet Archive - Collection Policies, Challenges, and Strategies

Learn about the Internet Archive, a digital library established in 1996 that contains an extensive collection of publicly accessible digital archival material. Explore their web archiving collection policies, key challenges, and strategies for preservation. Discover the vast landscape of the web today and the challenges of archiving its content.

evangelinek
Download Presentation

Web Archiving at the Internet Archive - Collection Policies, Challenges, and Strategies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Archiving @ The Internet Archive

  2. Agenda • Brief Introduction to IA • Web Archiving Collection Policies and Strategies • Key Challenges (opportunities for broader collaboration…)

  3. What is the Internet Archive? • A digital library established in 1996 that contains over four and a half petabytes (compressed) of publicly accessible digital archival material • A 501(c)(3) non profit organization • A technology partner to libraries, archives, museums, universities, research institutes, and memory institutions • Currently archiving books, texts, film, video, audio, images, software, educational content, television, and the Internet… www.archive.org

  4. Data Storage & Preservation

  5. IA’s Web archive spans 1996-present & includes over 150 billion web instances Develop freely available, open source, web archiving & access tools (Heritrix, Wayback, NutchWAX…) Provide services that enable partners to drive their web archiving programs Perform crawls & host collections for libraries, archives, universities, museums, & other memory institutions www.archive.org/web/www.archiveit.org

  6. Today’s Landscape “The current size of the world’s digital content is equivalent to all the information that could be stored on 75bn Apple iPads, or the amount [of data] that would be generated by everyone in the world posting messages on Twitter constantly for a century. ..” SRc: UK Telegraph http://www.telegraph.co.uk/technology/news/7675214/Zettabytes-overtake-petabytes-as-largest-unit-of-digital-measurement.html IDC annual survey, released May 2010

  7. Today’s Web Landscape • Google: “seen well over 1 trillion unique URLs” • Actual indexed pages: • tens of billions+ (~40-50bil?) • Cuil: “127 bil web pages” (July 15, 2010) • Hundreds of millions of “sites” • Site: publishing network endpoint; One page to millions per site • Diversity of content – streamed, social, interactive…

  8. Collection Policies & Strategies • Crawl Strategies 1) Broad, web-wide surveys from every domain, in every language, including media and text, static and interactive interfaces 2) Organic link discovery at all levels of a host/site 3) End of life, exhaustive harvests 4) Selective/Thematic & resource-specific harvests • Key Inputs: registry data, trusted directories, wikipedia, subject matter experts, prior crawl data • Frequency: usually ongoing but at least Yrly…

  9. Typical Challenges of Archiving the Web • Harvests are at best samples • Time & expense: can’t get everything • Rate of change: don’t get every version • Rate of collection: issues of ‘time skew’ • User agents/ Protocols

  10. Typical Challenges, cont. • Publisher right to opt “in” or “out” • Content behind log-ins can not be archived w/o credentials • Content can be blocked by robots.txt files (which our crawlers respect by default) • Structure of the sites/urls make it very hard to capture only the content of interest. Each site has its own unique set of challenges. • Some parts of sites are not “archive-friendly” (i.e. complex javascript, flash, etc.) • These sites tend to change both their technical structure and policy quickly and often.

  11. Challenges, cont. ~70% of the world’s digital content is now generated by individuals SRc: UK Telegraph, IDC annual survey, released May 2010 Social networks and collaborative/semi-private spaces Immersive Worlds

  12. Web QA & Analysis Daunting scale, requires multi-layered approach • Automated QA to identify missing files used to render pages and prioritize URI’s for harvest • Filtering of spam and content farms discovered during harvest and post harvest • Randomized, representative, human critique of “in” vs “out” of scope per given legal mandate • Advanced analyses: Web and link graphing, text mining

  13. Key Challenges • Not all data can be crawled, need diverse methods of data collection • Data may be lost no matter how carefully it is managed • Need to keep multiple, distributed copies! • Harvested data can be hard to make accessible in a compelling way, on an ongoing basis, at *every* scale • Research and experimentation are essential to keep pace publisher innovation, partnerships are the only way to “keep up” & to support demands of ongoing operations

  14. Key Challenges • Manageable Costs/Sustainable Approaches • Access to power & other critical operational resources • Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources • Support for on demand assembly of collections from aggregate data sets • Timeliness of collection & access • Intuitive interfaces for discovering & navigating resources over time, including robust APIs • Recruitment of engineering talent • Funding

  15. Thank You! Kris Carpenter Negulescu Director, Web Group Internet Archive kcarpenter [at] archive [dot] org

More Related