Adrian Hine, Natural History Museum, London

The iCollections Project and the Art of Digitisation: Reconciling One Messy Dataset with another Messy Dataset Adrian Hine, Natural History Museum, London

iCollections Background • Part of the broader ambition to digitise the NHM collections over the next decade. • iCollections began March 2013 for 3 years, using 7 full time digitisersplus involvementfrom IT, collections, imaging facility & data management staff. • Digitise the British Lepidoptera Collection (Butterflies & Moths) of approximately 500,000 specimens (5000 drawers). • Why Lepidoptera? Appeal to 3 audiences; general public, amateur entomologists & researchers. • Focus of digitisation is on capturing the specimen data, not on the specimen image per se.

Time Series - Butterflies

iCollections Aims • Digitise the UK Lepidoptera Collection. • A prototype project to investigate workflows for mass digitisation of collections at the NHM. • Generate research data: data available to the NHM climate change research team– looking at phenology. • Generate national occurrence data to the NBN. • Opportunity to rehouse specimens in new drawers. • Each specimen given a unique specimen number (Data Matrix barcode & human readable).

Data Capture & Processing • Data quality is at the heart of the digitisation process. We wish to control the quality of data going into EMu. • Didn’t want to simply be pushing large quantities of unqualified data into EMu to have to deal with at a later stage. • Sustainable, consistent, systematic, documented approach to data capture. • Every stage of the digitisation process to follow agreed protocols.

Data Capture & Processing • Opted for data capture outside EMu • poor quality data in EMu makes databasing directly into EMu difficult and slow (sites, taxonomy, parties). • build a highly streamlined data entry interface for core data. • build tools to help with data processing & validation. Control data going into EMu (reduce duplication and poor data generation). • generate a harmonised set of georeferenced UK site records (existing ones poor). • The significant challenge when generating new data is how to reconcile it with existing data within EMu (taxonomy, sites, parties, specimens).

Digitisation Workflow Overview 1) Preparation & Imaging Digitiser 2) Record ingestion Script 3) Rapid data entry (transcription) Digitiser 4) Data validation & reconciliation Specialist 5) Georeferencing Specialist 6) Import into EMu Data Manager

1) Preparation

1) Imaging

2) Record Ingestion • Script uses the application Barcodefiler to search the image for a barcode. If one is found the script renames the image filename with the specimen number. • It then creates a stub record in the rapid data capture system (SQL backend) with three core data fields; • specimen number (from barcode) • drawer number (from folder name) • taxon name (from folder name) • Using ImageMagic libraries it creates a cropped label derivative image. • Imports an image record for the master and derivative.

3) Rapid Data Entry (Transcription)

4) Data Validation/Reconciliation • Biggest challenge is how we validate/reconcile data generated from the project with EMu data. • Wish to use appropriate records where they exist in EMu and not to create additional duplicates. • Data concepts we wish to validate against existing EMu records where possible; • Taxonomy (determination) • Parties (collectors) • Locations (drawers) • Data concepts to create as new • Sites

Taxonomy • EMu - Taxonomy still a mess! For UK butterflies, 1000’s of names. Duplicates, erroneous names, different combinations. • Did not have the time to clean Taxonomy for UK Lepidoptera. We have to live with the mess! • Need taxonomic expertise to validate the iCollections name with the correct concept in EMu. • Many aberrational names not present in EMu. • Typos, errors when entering names by digitisers. • Can’t rely on the EMu import algorithms as matching taxon names is too complex. • Built mapping tool to reconcile captured taxon name with existing EMu name.

Taxon Mapper Tool

Sites • Messy data makes databasing directly difficult. Sites has poor quality data. Very few are usable, very poor consistency of how data have been captured (diverse data sources). • Mapping site variants to a master site record. • String Thing • Box Hill • Box Hill; Surrey • Box Hill; Kent Box Hill; Surrey; UK • Box Hill; near Dorking • Box Hill District • In process of building a tool to help harmonise the site data prior to import.

5) Georeferencing • Attempting to automate where possible. List of UK & Irish Ordnance Survey Centroids for place names. • Where single match found the centroid can be applied automatically. • Decoupled from other steps in the workflow so can be done independently and won’t be a bottleneck.

6) Import into EMu • Import is a phased approach; • 1) Images. KE have built a backend script to ingest multimedia records from server side to speed up ingestion rates. Much quicker than importing through the client. Uses the batch operations module. • Reports out a csv with the EMu irn & file name identifier. • 2) Specimen record (basic) • 3) Append the Collection Event data • 4) Append georeference data

Issues • Barcode no reads or misreads. • Printing quality of barcodes. • Multiple specimens on one pin. • Conflicting data. • Data difficult to interpret. • Specimens with existing specimen number that is not a barcode. • Specimen records exist already in EMu.

Progress to Date • Median rates; Preparation: 1.15 minutes Imaging: 1.05 minutes Transcription: 0.59 minutes Total: 2.80 minutes • Total completed so far; Imaged: 97,000 specimens Transcribed: 87,000 specimens • Import to EMu beginning.

A Team Effort The success is due to the project having a strong team ethic, pulling together museum staff from a wide variety of different disciplines.

Questions?

Adrian Hine, Natural History Museum, London