300 likes | 523 Views
Creating Digital Web Archives through Collection, Collaboration, and Curation Kristine Hanna Director of Archiving Services For the Fulbright Academy Workshop – Jan 24-25, 2011. What is the Internet Archive. We are a Digital Library Mission Statement: Universal access to human knowledge
E N D
Creating Digital Web Archives through Collection, Collaboration, andCurationKristine HannaDirector of Archiving ServicesFor the Fulbright Academy Workshop – Jan 24-25, 2011
What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Largest public web archive in existence founded in 1996 by Brewster Kahle in San Francisco California In 2007 officially designated a library by the state of California
What is in the General Archive • 200+ billion web pages, aggregated from 85 million websites in over 40 languages • Books and Texts • Films and Videos • Audio and the Spoken Word • Still Images • NASA Images • Open Library • Software and Educational tools
Storage and Preservation Multiple copies in multiple places • Data stored on over 4000 servers. Standard storage boxes, open source design • 2 Copies Online (primary and back up) in San Francisco Bay Area Data Centers • 3rd Copy at Sun Microsystems in Santa Clara, California • Partial mirrors in Egypt, France and Netherlands • Partners can receive copies of their data
Why are We doing this? • Hundreds of billions of people around the world have grown accustomed to using the web as their primary resource to acquire information. • The availability of this electronic information is taken for granted and it is a fallacy that if something is on the web it will be there forever. • There’s an essential need for people to understand that the web represents who we are. It’s our culture and our social fabric, and we don’t want to lose it.
Why our Partners are doing this? • Construct an historical record of an institution’s web presence over time by archiving main website and other sites on which the institution is mentioned. • Collaborate with other institutions and share research • Assemble a comprehensive data base of information on a topic, photograph or individual with different perspectives. Capture social commentary - tweets, blogs, comments. • Maintain strong electronic records management system. • Create a web archive on a specific topic, subject or event • Capture and archive "at risk" digital content on a spontaneous event
Who are our Partners? Over 200 partners in 25 countries and 47 U.S. States: National Libraries and Federal Instiutions U.S. National Archives (NARA) U.S. State Archives/Libraries University Libraries Museums and Art Libraries Local (city) Institutions and Public Libraries Historical Societies
Open Source Technology developed by the Internet Archive & IIPC How do we collect the Content? Heritrix: Web crawler – captures pages. Wayback Machine: Renders pages– makes it possible to view those pages and surf the web as it was. NutchWAX: Search engine – provides full-text search
Web Archiving Services/Models WWW crawls: broad snapshots run in house by crawl engineers Contract Crawls: focused and curated crawls run in house by crawl engineers Archive-It: Web based application that allows partners to create, manage and preserve collections of highly curated digital content. • Functions include: selection and scoping, harvesting, cataloging with metadata, full text search, reports and analysis of collections • Ability to capture content using ten different crawl frequencies • Content includes: text, html, video, audio, social networking, PDF, still images, newspapers
Stanford University, Islamic & Middle Eastern Collection Purpose: harvest and preserve Iranian Blogs • Archiving over 300 blogs written by and for Iran and the Iranian people • Includes coverage of 2009 Iranian elections
Stanford University Islamic and Middle Eastern Collection
University of Texas at Austin:LANIC Purpose: Archive documents from 18 different countries and 300 government ministries and presidencies. Content includes: • full-text versions of official documents • original video and audio recordings of key regional leaders • thousands of annual and "state of the nation" reports • Specific collections for Latin American elections and political parties
American University of Cairo Collections: • American University in Cairo website • Coptic Religion & Culture • Egyptian Arts, Culture & Society • Egyptian Business • Migration and Refugee Studies
Egypt Today is the leading current affairs magazine
Tunisian Unrest 2011 Archiving Blogs, News sites, Social media Websites suggested by curators as subject matter experts (Bnf) http://www.archive-it.org/public/collection.html?id=2323
Tunisia Watch A website that focuses on Tunisian issues/events
Access to Collections Partners: • Can view through private web application or access page with login/password General Public: • Can view from Archive-It website or General Archive website • Can view from Partners website - links back to Archive-It hosted data • Partners can host data from their servers -Restricted and private access options are available
What’s next at Internet Archive? • Collaboration and Partnerships • Digital Stewardship • Continue to develop services that help memory institutions and further our mission • Forge new global partnerships • Develop a preservation policy/access model • Digital Archive
Thank You! Kristine Hanna Internet Archive Director, Archiving Services kristine@archive.org 415 561 6799 x 5