310 likes | 326 Views
This project aims to develop a model for research libraries to identify, respond to, and mitigate risks to the integrity and longevity of web resources. It includes stages of identification, analysis, appraisal, strategy, detection, and response. The VRC Toolkit provides tools for server-level monitoring, web crawling, and site management.
E N D
VRC: Preservation Risk Management for Web Resources Nancy Y. McGovern, ECURE 2004
VRC Funding • Part of a 4(5)-year NSF-funded project • supported by the Digital Libraries Initiative, Phase 2 (Grant No. IIS-9905955, the Prism Project) • Also partially funded by a grant from The Andrew W. Mellon Foundation • Political Communications Web Archiving http://www.crl.edu/content/PolitWeb.htm • For updates: • http://irisresearch.library.cornell.edu/VRC/
Current Team Anne R. Kenney, Advisor Nancy Y. McGovern, Project Manager Richard Entlich, Sr. Researcher William R. Kehoe, Technology Coordinator Ellie Buckley, Digital Research Specialist Erica Olsen (recent) Carl Lagoze, CIS PI
Research Scope see, "Preservation Risk Management for Web Resources: Virtual Remote Control in Cornell's Project Prism" by Anne R. Kenney, Nancy Y. McGovern, Peter Botticelli, Richard Entlich, Carl Lagoze, and Sandra Payette in DLib Magazine, January 2002 http://www.dlib.org/dlib/january02/kenney/01kenney.html
Virtual… • because VRC develops models to represent essential features of selected Web sites • that enable ongoing monitoring over time • to identify, respond to, and mitigate potential risks to the site integrity and longevity
Remote… • because VRC is intended for use by cultural heritage institutions • interested in the longevity of Web resources • residing on remote servers – not owned or managed by the monitoring institution
Control… • because at the most proactive end of the VRC approach • a monitoring organization may act to protect another organization's resources • by agreement or implicit consent • through notification and/or action
Purpose • Develop a model for research libraries (adaptable to other contexts) • Support spectrum from passive monitoring to active capture • Lifecycle support: selection to capture • Understand nature of Web resources • Promulgate good practice
Types of Web Resources Two types of initiatives for monitoring and/or capture of: • Web-based publications [Web site as a means] • All of (or a subset of) a Web site consisting of pages within a boundary defined by a URL (or a portion of one) [Web site as an end] (VRC)
Nature of Risks Two perspectives on Web-based risk: • potential liability of an institution based upon the content of its Web site, or a Web site for which it is responsible • potential threats to the integrity and longevity of a Web resource (VRC)
Types of Risks Include: • technological obsolescence • security weaknesses and breaches • human-error in developing/maintaining sites • organizational issues; benign neglect • power and technology failures • inadequate backup and secondary systems
Risk Factors • Organizational Context • Combination of indicators • Monitoring (change/loss over time) • Triggers (events, organizational, upgrades) • Degradation of site management indicators
VRC Stages • Identification • Analysis • Appraisal • Strategy • Detection • Response
Human – Tool Scenario 1. Identification • Human: identify Web resources of interest • Tool: verify list, expand list 2. Analysis • Tool: crawl sites, generate characterizations • Human: accept/revise characterizations 3. Appraisal • Human: define/review attributes of value • Tool: support appraisal, capture results
Human – Tool Scenario 4. Strategy • Human: develop/review strategies • Tool: plot appraisals, compile strategies 5. Detection • Human: define risk parameters • Tool: identify/assess risks; propose responses 6. Response • Tool: propose risk response based on rules; automatic response for some risk categories • Human: monitor automated responses; select response based on recommended actions
Server-level Monitoring • Potential multi-site impact • Server vulnerabilities put site content at risk • deletion or modification • Patches and new versions of Microsoft IIS and Apache server released frequently • Apache http server 1.3 security updates • to version 1.3.26 on June 18, 2002 • to version 1.3.27 on October 3, 2002
VRC Toolkit • Identify tools for each stage (adopt, adapt, define, devise) • Leverage existing; apply to longevity • Analyze steps - automated and manual • Formalize protocol • Provide a framework to map existing, plug gaps with developments
VRC Toolkit Development steps: • extensive literature review • development of tool categories • definition of categories and test protocols • survey existing tools for evaluation • select representative for testing • highlight findings in category summaries
Web Crawling • traversing Web sites via links • a capability common to most tools, but with different purposes and results • the VRC toolkit needs more than just Web crawlers
Tool Categories Link checkers Web site monitors Web crawlers Site management Change Management Site Mapping (includes visualization)
OAIS Issues • Pre-Ingest: Selection options • Ingest: Capture • vs. monitoring • Targets, level and frequency • Archival Storage: Formats • Access: Site(s) vs. Page(s) • AIP: Metadata issues
Management Issues • frequency of capture – determined by • nature of sites/pages • events: technological, organizational • resources • well-informed crawling • valuable vs. archival
Mandate • to fully document the site by capturing all changes to the pages/sites • to capture significant changes to pages/sites • to record periodic versions of the site • to capture one-time copy of pages/sites
Current Activities • VRC Preservation Risk Management Program: • Map stages to tool requirements • Apply to potential organizational scenarios • Enable risk/response scenario development • Toolkit: • Revise and populate tool inventory • VRC Control Site
Future Projects • Develop approach for building human sexuality collection: capturing Web blogs and other Internet communications • State Government Web site case study • Demonstrators for toolkit scenarios
For Discussion What would the VRC approach have to address to be of interest, value, and/or potential impact for archivists and records managers?