Rob Walpole Deve X e Limited The National Archives

The National ArchivesDigital Records Infrastructure CatalogueFirst Steps to Creating a Semantic Digital Archive Rob Walpole DeveXe Limited The National Archives

Disclaimer This presentation is in no way intended to express views or opinions of The National Archives and is solely the work of Rob Walpole, an employee of DeveXe Limited who are currently contracted to assist in the development of the Digital Records Infrastructure at Kew in London.Apart from providing a case study of developing a semantic digital archive, this presentation discusses the opportunities permitted by such development. It should not be assumed that these developments will occur and DeveXe Limited take no responsibility for any perceived inaccuracies.

Background

Background – The National Archives The National Archives (TNA) Over 11 million historical government and public records From the Domesday Book to the Agreement on a Referendum on Independence for Scotland Photo by Chris Hill

Background – The National Archives The National Archives (TNA) Over 11 million historical government and public records From the Domesday Book to the Agreement on a Referendum on Independence for Scotland But not births, deaths and marriages, these are held by the General Register Office! Photo by Chris Hill

Background – The National Archives Most of these documents are currently held on paper - or even parchment... Photo by Liz West

Background – The National Archives But soon this will be overtaken by a tsunami of digital files... Photo by Marco Mazzei

Background – The National Archives But soon this will be overtaken by a tsunami of digital files... ...including office documents, emails, images, videos and much more. Photo by Marco Mazzei

Background – Digital Records Infrastructure There are many challenges around digital preservation including:- Format recognition Software preservation Compatibility Degradation of media

Background – Digital Records Infrastructure There are many challenges around digital preservation including:- Format recognition Software preservation Compatibility Degradation of media Many of these issues were highlighted by the BBC Domesday Project (1986)

Background – Digital Records Infrastructure TNA have been at the forefront of meeting this digital preservation challenge:- PRONOM – file format registry DROID – file format identification tool Legislation.gov.uk – all UK legislation on-line UK Government Web Archive –http://www.nationalarchives.gov.uk/webarchive/ The London Gazette – published by HMSO (part of TNA)

Background – Digital Records Infrastructure In 2006 TNA deployed the Digital Repository System (DRS) which provided terabyte scale long-term storage. In 2012 TNA starts to build DRI (Digital Records Infrastructure) on the foundations of DRS to deliver extensible storage to the petabyte scale and beyond.

80,000 Digitised Home Guard Records from World War 2 were ingested into DRI as a proof of concept... Background – Digital Records Infrastructure

Background – Digital Records Infrastructure 80,000 Digitised Home Guard Records from World War 2 were ingested into DRI as a proof of concept... ...now many more including LOCOG (2012 Olympic games) Leveson Enquiry

Background – Digital Records Infrastructure At its core this massive storage is provided by a robot tape library with frequently requested and low resolution copies of data held in a disk cache. Photo by Cory Doctrow

Background – The DRI Catalogue The DRI Catalogue is essentially an inventory of the items held in the archive. It is distinct from the TNA Catalogue which is a comprehensive catalogue system covering both paper and digital documents. Public access to the TNA Catalogue is provided by Discovery.

Background – The DRI Catalogue Rich XML metadata is stored in the archive itself, alongside the original document and a copy is sent to Discovery. This comes from a variety of sources: Record provider Archiving process Document transcription Archivists And there is a very good reason for using XML...

Background – The DRI Catalogue Rich XML metadata is stored in the archive itself, alongside the original document and a copy is sent to Discovery. This comes from a variety of sources: Record provider Archiving process Document transcription Archivists And there is a very good reason for using XML... ...it's human readable!

Requirements

Requirements – The DRI Catalogue Apart from being an inventory, the DRI Catalogue is needed to help manage:- Closure information Record opening Export lists Export status

Requirements - Closure Closure can be very fine-grained. e.g. Home Guard records have open description (individual's name, battalion etc.) but Service record closed until individual deceased Medical record closed until record = 100 years old

Requirements – Record Export The export process itself is in the form of a work-flow with many steps. The DRI Catalogue must maintain the status and other information about the export...

Requirements – The Problem Initially the DRI Catalogue was held in an RDBMS. However the fine-grained nature of closure meant very slow queries when attempting to export large numbers of records – sometimes taking hours to complete! Another approach was needed...!

Requirements – Initial Analysis Three different proposals were made for modelling the catalogue and therefore a trial was conducted to establish the best approach. Three models trialled were:- Relational – optimising the existing SQL queries against a modified table structure Graph – running SPARQL queries against a RDF store Hierarchical – running XQuery against a XML database

Requirements – Analysis Results Relational – reduced query time from hours to minutes 😌

Requirements – Analysis Results Relational – reduced query time from hours to minutes 😌 Graph – reduced query time to seconds 😎

Requirements – Analysis Results Relational – reduced query time from hours to minutes 😌 Graph – reduced query time to seconds 😎 Hierarchical – approach abandoned 😞

Requirements – Analysis Results The hierarchical approach was abandoned because:- The graph approach provided a good solution The graph approach offered a path towards Linked Data Cost overheads and deadlines obliged us to move on.

Requirements – Analysis Results The hierarchical approach was abandoned because:- The graph approach provided a good solution The graph approach offered a path towards Linked Data Cost overheads and deadlines obliged us to move on. A hierarchical approach may have offered comparable performance and opportunity, we simply don't know...

Requirements – Analysis Conclusion The issues of closure and export had led to fundamental questions about the nature of the catalogue. We don't know exactly what information will need to go into DRI but we know it will be information about people, organisations, their relationships and activities. These things are complex and varied – just like the world around us! A graph approach not only resolved the issues with closure and export but provides a powerful and flexible tool for discovering information within the archive.

Design

Design - Technology Technologies used during the trial included:- D2RQ Apache Jena framework (including TDB and Fuseki) Turtle (RDF) SPARQL 1.1 (Query and Update) The Jena framework was chosen because:- Excellent Java API Open Source

Design - Technology UK Government Service Design Manual states... “...it remains the policy of the government that, where there is no significant overall cost difference between open and non-open source products that fulfil minimum and essential capabilities, open source will be selected on the basis of its inherent flexibility.” “Use open standards and common Government platforms (e.g. Identity Assurance) where available”

Design – The Catalogue Services

Design – DRI Vocabulary W3C recommend re-using vocabularies wherever possible and DRI already does this extensively in the XML metadata. But we needed to be able to talk about things very specific to DRI such as Closure and Export . So we extended the RDF Schema (RDFS) with a few of our own classes and properties such as:- <http://nationalarchives.gov.uk/terms/dri#Export> rdf:type rdfs:Class . <http://nationalarchives.gov.uk/terms/dri#exportMember> rdf:type rdfs:Property

Design – DRI Vocabulary This allows us to talk about DRI exports such as:- <http://nationalarchives.gov.uk/dri/catalogue/export/123> a dri:Export ; dri:exportMember <http://nationalarchives.gov.uk/dri/catalogue/item/abc> ; dri:exportMember <http://nationalarchives.gov.uk/dri/catalogue/item/def> .

Design – The Catalogue Services The Apache Jena Framework provides a straightforward approach to reading, writing, updating and deleting data using W3C standards...

Design – The Catalogue Services The Apache Jena Framework provides a straightforward approach to reading, writing, updating and deleting data using W3C standards... Reading – SPARQL 1.1 Query Language Writing – creating and persisting new RDF triples (e.g. Turtle) SPARQL 1.1 Graph Store Protocol Updating and Deleting – SPARQL 1.1 Update Language

Design – The Catalogue Services However... ...having to learn SPARQL can be a hurdle to widespread acceptance of this technology! The answer... ...Elda (Linked Data API implementation) provides RESTful access to pre-configured SPARQL queries:- spec:collectionList a apivc:ListEndpoint ; apivc:uriTemplate "/collection" ; apivc:selector [ apivc:where " ?item a dri:Collection . " ; ]; .

Design - Implementation So how did we actually do it...? Create a mapping from RDBMS to vocabulary terms Export data from RDBMS to N-Quads using D2RQ Load N-Quads into Jena TDB (embedded version) Write SPARQL transform (CONSTRUCT) queries to refine RDF Run queries in Fuseki, download results and reload into clean database instances

Design - Implementation

Design – Catalogue Services API RESTful JAX-RS web application providing a very simple API <result xmlns=”http://nationalarchives.gov.uk/dri/catalogue”> <uuid>e9f3c8e9-e883-4fcf-a9a3-5caf0c808c5d</uuid> </result> Why XML? Why not JSON? Web services consumed by Java applications. JSON is used in some circumstances, i.e. for a JavaScript tree editor.

Design – Insights Issues and Limitations Elda – Linked Data API Implementation

Design – Insights Issues and Limitations

Design – Insights Issues and Limitations Xturtle Shortage of RDF/SPARQL editors and IDEs! Xturtle provides a useful syntax highlighting plug-in for Eclipse...

Design – Insights Issues and Limitations Scardf - http://code.google.com/p/scardf/ Model model = ModelFactory.createDefaultModel(); model.createResource( "http://somewhere/JohnSmith" ) .addProperty( N, model.createResource() .addProperty( Given, "John" ) .addProperty( Family, "Smith" ) );

Design – Insights Issues and Limitations Scardf - http://code.google.com/p/scardf/ Model model = ModelFactory.createDefaultModel(); model.createResource( "http://somewhere/JohnSmith" ) .addProperty( N, model.createResource() .addProperty( Given, "John" ) .addProperty( Family, "Smith" ) ); Graph( UriRef( "http://somewhere/JohnSmith" ) -N-> Branch( Given -> "John", Family -> "Smith" ) )

Design – Insights Issues and Limitations Scale and Performance Will the DRI Catalogue cope with the tsunami?

Design – Insights Issues and Limitations Scale and Performance Will the DRI Catalogue cope with the tsunami? We think it will... 1) This solution was chosen because of it's performance 2) We are confident we can scale horizontally. In fact a catalogue for each collection makes some sense. You could then create a catalogue of catalogues to search everything! 3) If the existing framework fails to scale satisfactorily the fact that we are using open standards means moving to another framework should be straightforward.

Rob Walpole Deve X e Limited The National Archives