220 likes | 656 Views
What is the Internet Archive. We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San Francisco California Officially designated a library by the state of California (2007). Archive-It. www.archive-it.org
E N D
What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San Francisco California Officially designated a library by the state of California (2007)
Archive-It www.archive-it.org First deployed in February 2006 • Web based application that allows users to create, manage and preserve collections of digital web content • Functions include: selection and scoping, harvesting, reports and analysis of captures, cataloging with metadata, full text search • Archived content includes: text, html, video, audio, images, PDF, online newspapers, social networking and more… • Includes hosting, access and storage (primary and back-up) • Archived content available for viewing 24 hours after a crawl has completed
Open Source Technology primarily developed by Internet Archive, the open source community, and the IIPC The Tools Behind Archive-It • Heritrix: web crawler - crawls and captures pages • Wayback Machine: access tool for rendering and viewing pages. Displays archived web pages--surf the web as it was. • NutchWAX: Open source search engine. Standard full-text search
Who Uses Archive-It 130 partners in 42 states and 12 countries • 35% University and College Libraries • 30% State Archives and Libraries • 15% Non Government Non Profits • 9% National Libraries/Federal Institutions • 7% K-12 Schools • 2% Cities and Public Libraries • 2% Museums and Art Libraries http://www.archive-it.org/public/partners
Why Archive Social Networking Sites? • State Agencies & Officials: An increasing number have decided that the content on these sites is a record and needs to be archived. • University libraries: Used to share information with students and alumni, and contain important records about a school's culture, student body and campus events. • Researchers: Used to preserve valuable social reactions and change on topics of interest • Currently about 20 Archive-It partners are archiving content from these sites
North Carolina State Archives & State Library of North Carolina Purpose: archive state agency websites and publications • Includes pages in a variety of formats: text, images, audio, video and social networking sites • Archive-It Partner since 2005 (pilot partner)
North Carolina State Archives & State Library of North Carolina
North Carolina State Archives & State Library of North Carolina
Library of Virginia • Purpose: Preserve websites relating to Virginia government and elections • Collection on current Governor includes Twitter and Flickr sites • Collection on Twitter, Flickr, and Facebook sites of politicians and political organizations in Virginia
Stanford University, Islamic and Middle Eastern Collection Purpose: Harvest and preserve Iranian Blogs • Archiving over 300 blogs written by and for Iran and the Iranian people • Archiving sites from Twitter, Facebook, and Youtube selected by the collection’s curators • Partner since February 2008 funded by Library of Congress
University of Texas, San Antonio • Purpose: Archive university websites, student organizations, academic departments, and other local topics important to their university • Archiving blogs, Facebook, Twitter, Flickr, MySpace • Partner since 2008
Typical Challenges • Content behind log-ins can not be archived • Content can be blocked by robots.txt files (which our crawlers respect by default) • Some parts of sites are not “archive-friendly” (i.e. complex javascript, Flash, etc.) • These sites tend to change both their technical structure and policy quickly and often. • Structure of the sites/urls means users need to add scoping rules to only capture content you are interested in. Each site has its own unique set of challenges.
Overall Approaches • Trial and Error: Try to harvest with a variety of settings • Quality Review: review archived content thoroughly • Collaborate: compare approaches and results with other Archive-It users • Document detailed instructions, lessons learned, and best practices for other partners
Thank you! www.archive-it.org http://www.facebook.com/ArchiveIt Kate Odell Partner Specialist, Internet Archive kate@archive.org