120 likes | 239 Views
Applying records management processes principles to the open government record. A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing. Overview. The National Archives’ Digital Strategy: An overview of the SKB project, including: The Problem
E N D
Applying records management processes principles to the open government record A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing
Overview • The National Archives’ Digital Strategy: • An overview of the SKB project, including: • The Problem • The Solution • Next Steps
Introducing the UK Government Web Archive • More than 18,000 crawls of over 3,000 websites from 1996-2014 • Approximately 90tb of data, 3.5 billion resources • More than 875,000 ARC files • More than 20 million pageviews and 2-3 million visits per month
Who are our users and what do they want? • User surveys on website: all banners and index pages • Established that UKGWA is regularly visited by a great variety of users. • The biggest area for dissatisfaction was found to be the existing search functions. • We constructed user stories so we could test the improvements.
Full Text Search – its limitations Our full text search is very useful and very much used, but is • limited by how the live sites were at crawl time • noisy as it contains much duplicate or near-duplicate material • reliant on keyword matching • most useful when combined with specialist knowledge
Semantic Search – What it allows • Aim was to improve access to information in the UKGWA by providing far richer information about what it contains • The semantic web is a start to tackling a limitation of the web • Becomes a dataset in its own right • Borrows from and contributes to the web • Technology open and machine-readable. APIs allow the data to be easily queried and integrated with other services • Awarded to a consortium led by Ontotext AD, the University of Sheffield and System Simulation
UKGWA: a good candidate for semantic search? • Each resource already has a persistent HTTP URI • UKGWA is both limited anddiverse • Genericand domain-specific meanings can be attributed to otherwise loose terms, e.g: • Facts can be modelled and refined to show the linkages between entities and how they change over time • 2010 general election was opportunity to demonstrate concept
Making UKGWA semantic – How? Image: Ontotext AD, University of Sheffield and System Simulation.
What we learned and next steps • We will deliver it as an internal system to develop further • It’s not AI! 60-70% annotation accuracy not bad at this scale! • Concept can be difficult to explain, and even harder for those unfamiliar with computer science to use (SPARQL etc) prefix skb:<http://proton.semanticweb.org/skb-ont#> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?URL ?title where { ?page <http://ordi.ontotext.com/sar#hasFeature> ?doc_feature . ?doc_feature <http://ordi.ontotext.com/sar#hasValue> ?URL. ?doc_feature <http://ordi.ontotext.com/sar#hasKey> "WEBARCHIVEURL" . ?page <http://proton.semanticweb.org/2006/05/protont#title> ?title . FILTER regex(str(?title), "Foot and Mouth", "i") . FILTER regex(str(?title), "Prime Minister", "i") . ?page <http://proton.semanticweb.org/2006/05/protont#hasDate> } • So, integrating the system with other services is a must.
Applying records management processes principles to the open government record Any Questions? Contact us: webarchive@nationalarchives.gsi.gov.uk Visit: nationalarchives.gov.uk/webarchive