1 / 21

Adrian Hine, Natural History Museum, London

The iCollections Project and the Art of Digitisation: Reconciling One Messy Dataset with another Messy Dataset. Adrian Hine, Natural History Museum, London. iCollections Background. Part of the broader ambition to digitise the NHM collections over the next decade.

zohar
Download Presentation

Adrian Hine, Natural History Museum, London

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The iCollections Project and the Art of Digitisation: Reconciling One Messy Dataset with another Messy Dataset Adrian Hine, Natural History Museum, London

  2. iCollections Background • Part of the broader ambition to digitise the NHM collections over the next decade. • iCollections began March 2013 for 3 years, using 7 full time digitisersplus involvementfrom IT, collections, imaging facility & data management staff. • Digitise the British Lepidoptera Collection (Butterflies & Moths) of approximately 500,000 specimens (5000 drawers). • Why Lepidoptera? Appeal to 3 audiences; general public, amateur entomologists & researchers. • Focus of digitisation is on capturing the specimen data, not on the specimen image per se.

  3. Time Series - Butterflies

  4. iCollections Aims • Digitise the UK Lepidoptera Collection. • A prototype project to investigate workflows for mass digitisation of collections at the NHM. • Generate research data: data available to the NHM climate change research team– looking at phenology. • Generate national occurrence data to the NBN. • Opportunity to rehouse specimens in new drawers. • Each specimen given a unique specimen number (Data Matrix barcode & human readable).

  5. Data Capture & Processing • Data quality is at the heart of the digitisation process. We wish to control the quality of data going into EMu. • Didn’t want to simply be pushing large quantities of unqualified data into EMu to have to deal with at a later stage. • Sustainable, consistent, systematic, documented approach to data capture. • Every stage of the digitisation process to follow agreed protocols.

  6. Data Capture & Processing • Opted for data capture outside EMu • poor quality data in EMu makes databasing directly into EMu difficult and slow (sites, taxonomy, parties). • build a highly streamlined data entry interface for core data. • build tools to help with data processing & validation. Control data going into EMu (reduce duplication and poor data generation). • generate a harmonised set of georeferenced UK site records (existing ones poor). • The significant challenge when generating new data is how to reconcile it with existing data within EMu (taxonomy, sites, parties, specimens).

  7. Digitisation Workflow Overview 1) Preparation & Imaging Digitiser 2) Record ingestion Script 3) Rapid data entry (transcription) Digitiser 4) Data validation & reconciliation Specialist 5) Georeferencing Specialist 6) Import into EMu Data Manager

  8. 1) Preparation

  9. 1) Imaging

  10. 2) Record Ingestion • Script uses the application Barcodefiler to search the image for a barcode. If one is found the script renames the image filename with the specimen number. • It then creates a stub record in the rapid data capture system (SQL backend) with three core data fields; • specimen number (from barcode) • drawer number (from folder name) • taxon name (from folder name) • Using ImageMagic libraries it creates a cropped label derivative image. • Imports an image record for the master and derivative.

  11. 3) Rapid Data Entry (Transcription)

  12. 4) Data Validation/Reconciliation • Biggest challenge is how we validate/reconcile data generated from the project with EMu data. • Wish to use appropriate records where they exist in EMu and not to create additional duplicates. • Data concepts we wish to validate against existing EMu records where possible; • Taxonomy (determination) • Parties (collectors) • Locations (drawers) • Data concepts to create as new • Sites

  13. Taxonomy • EMu - Taxonomy still a mess! For UK butterflies, 1000’s of names. Duplicates, erroneous names, different combinations. • Did not have the time to clean Taxonomy for UK Lepidoptera. We have to live with the mess! • Need taxonomic expertise to validate the iCollections name with the correct concept in EMu. • Many aberrational names not present in EMu. • Typos, errors when entering names by digitisers. • Can’t rely on the EMu import algorithms as matching taxon names is too complex. • Built mapping tool to reconcile captured taxon name with existing EMu name.

  14. Taxon Mapper Tool

  15. Sites • Messy data makes databasing directly difficult. Sites has poor quality data. Very few are usable, very poor consistency of how data have been captured (diverse data sources). • Mapping site variants to a master site record. • String Thing • Box Hill • Box Hill; Surrey • Box Hill; Kent Box Hill; Surrey; UK • Box Hill; near Dorking • Box Hill District • In process of building a tool to help harmonise the site data prior to import.

  16. 5) Georeferencing • Attempting to automate where possible. List of UK & Irish Ordnance Survey Centroids for place names. • Where single match found the centroid can be applied automatically. • Decoupled from other steps in the workflow so can be done independently and won’t be a bottleneck.

  17. 6) Import into EMu • Import is a phased approach; • 1) Images. KE have built a backend script to ingest multimedia records from server side to speed up ingestion rates. Much quicker than importing through the client. Uses the batch operations module. • Reports out a csv with the EMu irn & file name identifier. • 2) Specimen record (basic) • 3) Append the Collection Event data • 4) Append georeference data

  18. Issues • Barcode no reads or misreads. • Printing quality of barcodes. • Multiple specimens on one pin. • Conflicting data. • Data difficult to interpret. • Specimens with existing specimen number that is not a barcode. • Specimen records exist already in EMu.

  19. Progress to Date • Median rates; Preparation: 1.15 minutes Imaging: 1.05 minutes Transcription: 0.59 minutes Total: 2.80 minutes • Total completed so far; Imaged: 97,000 specimens Transcribed: 87,000 specimens • Import to EMu beginning.

  20. A Team Effort The success is due to the project having a strong team ethic, pulling together museum staff from a wide variety of different disciplines.

  21. Questions?

More Related