1 / 16

Building a National Collection of the Historical UK Web for scholarly use

Building a National Collection of the Historical UK Web for scholarly use. Helen Hockx-Yu Head of Web Archiving, British Library. IIPC General Assembly, Paris, May April 2014. Scholarly interaction with web archives (1). Archive-driven Initiated by archival institutions

buzz
Download Presentation

Building a National Collection of the Historical UK Web for scholarly use

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a National Collection of the Historical UK Web for scholarly use Helen Hockx-Yu Head of Web Archiving, British Library IIPC General Assembly, Paris, May April 2014

  2. Scholarly interaction with web archives (1) • Archive-driven • Initiated by archival institutions • Aimed at understanding scholarly requirements and improving archival practice • Scholar-driven • Initiated by scholars with research interest related to web archiving or archived web material, including many “unknown” scholars • A number of active research groups emerging: Netlab, WebArt and DMI, IHR, OII, ODU… • Attention from the Web Science community • Project-based • Various scale, scope and funding sources • Developing web archiving or discipline specific solutions • Researchers and archiving institutions work as partners

  3. Scholarly interaction with web archives (2) • Phase 1: Building collections • Scholars’ involvement in scoping collections, selecting and describing websites relevant to research interest • Creation of specific, (narrow) topical collections, e.g. “Religion, politics and law since 2005” in the UK Web Archive • Phase 2: Formulating research questions • Brain-storm sessions, workshops etc. • Shift of focus to web archives in entirety • Lack of awareness & baseline knowledge • Time & resource consuming • Challenging: you don’t know what you don’t know

  4. Scholarly interaction: the “go-to” state • Independent use of web archives • Meet common scholarly requirements, support scholarly workflow • Base-line knowledge is self-explanatory, e.g. scope of the archive, its coverage and lacunae, how it was collected, and how a particular website was crawled • Clear interfaces and jargon-free descriptions in alignment with scholarly requirements • Open access • Including provision of downloadable derived or secondary datasets, e.g. http://data.webarchive.org.uk/opendata/ • Publication of work citing web archives

  5. Selective archiving since 2003 • Permission-based • Open UK Web Archive http://www.webarchive.org.uk/ukwa/ • ~14,000 websites, ~64,000 instances • URL and full-text search • Curated collections • Many websites no longer available on the live web

  6. 6th April 2013… • Legal Deposit Libraries (Non-Print Works) Regulations 2013 • Extension of existing legal framework • Systematic collection of UK’s published output for heritage & preservation • By 6 UK Legal Deposit Libraries

  7. JISC UK Web Domain dataset (1996-2014) • Collaboration between the Internet Archive (IA), the Joint Information Systems Committee (JISC) and the British Library • Extracted copies of UK websites from the Internet Archives collection • 1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs • 2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs (estimated) • Research agreement between JISC and IA, upholding IA’s Terms of Use • Access via IA’s Wayback Machine • Allows replication / extraction of derivative or secondary datasets • BL hosts the dataset on behalf of JISC

  8. Completed work • Analytical Access to the Domain Dark Archive Project • Use cases & experimental UI • Demonstrating the Value of the UK Web Domain Dataset for Social Science Research • Analysis of link graph • Paper accepted for WebSci’14: Mapping the UK Webspace: Fifteen Years of British Universities on the Web • MA thesis by Jules Mataly: The Three Truths of Margaret Thatcher: Creating and Analysing • Secondary datasets under open licence • Format profile, Geoindex, Host Link Graph

  9. Exploring Host Link Graph Courtesy of Peter Webster, Rainer Simon and Jules Mataly

  10. Visualising links (to and from bl.uk) Interactive version How it is done

  11. Visualising links (to and from bl.uk) Interactive version How it is done

  12. Big UK Domain Data for Arts and Humanities • Funded by the UK Arts and Humanities Research Council as one of the 21 “Big Data” projects • Collaboration between the Institution of Historical Research, Oxford Internet Institute, British Library and Aarhus University • Develop theoretical and methodological framework for the study of web archives • Build on ADDAA: researchers and the BL co-produce access tools • A major study of the history of UK web space from 1996 to 2013 + sub-projects covering a range of disciplines • Also an online training course and peer-reviewed journal articles.

  13. New projects and initiatives • "ALEXANDRIA: Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives • 5-year project funded by the European Research Council • Develop new models and algorithms for retrieval, exploration, and analytics of web archives • Collaborate on common issues, eg, publications date versus crawl dates • RESAW, a Research Infrastructure for the Study of Archived Web Materials • Currently a coordinated, self-organising, and self-financing open network • Preparing application for EU’s Horizon 2020 framework 

  14. Benefits • Helps researchers understand the value of web archives and explore new ways of using these for scholarly research • Allows BL to obtain hands-on experience with indexing and processing large scale web archive datasets • Analytics and visualisations can be applied to our own Legal Deposit collection • Acts as test-bed for research and development projects • Enables BL to participate in various UK, European and international projects • Helps curators understand characteristics of large scale digital corpora • Improve the way we collet and store web archive

  15. Some Issues • Ownership • Data quality • Different formats, ARC and WARCs • Partially de-duplicated • Context • No crawl log or information o data cap applied during crawl time • No detailed information on extraction mechanism • More general issues related to analytical access • Scepticism or suspicion about hidden algorithms behind analysis • Biases in data and how data collection decisions lead to variances in outputs • Need to manage expectations, analysis and visualisation as finished products and first steps • Ethical and privacy issues

  16. Thank you!Questions?Getting in touch:Twitter: @ukwebarchiveEmail: web-archivist@bl.ukUK Web Archive: http://www.webarchive.org.uk

More Related