200 likes | 356 Views
Data cleansing for Dummies: Google to the rescue!!. Dave Smith Petrology Collections Manager. The Natural History Museum, London. Architectural wonders. Waterhouse building opened in 1881 Steel frame and terracotta Purpose built for natural history collections. The Museum. 1000 staff
E N D
Data cleansing for Dummies:Google to the rescue!! Dave Smith Petrology Collections Manager
Architectural wonders • Waterhouse building opened in 1881 • Steel frame and terracotta • Purpose built for natural history collections
TheMuseum • 1000 staff • 350 science staff • 72 million specimens (estimated) • Life Sciences • Plants, animals, birds, insects • Earth Sciences • Minerals & gems, rocks, fossils, meteorites
My role • Geologist by training • Collections Manager for rock collections • 125,000 rocks • 10,000 decorative stones • 37,000 ocean sediments • 16,000 ore specimens • Departmental EMu administrator • Registry management • Report writing • Training & documentation • EMusupport & upgrade testing • Communication
‘Fingers in lots of pies’ • Have been involved in cross-museum initiatives involving EMu.
01110010100101010 10010100010001011 11100001010100101 00100100010010101 11010110010010010 00101001010010101 Data cleansing for Dummies:Google to the rescue!! Dave Smith Petrology Collections Manager
Core Information • 89,000 Records (73%) • Identification = 52,100 • Provenance = 64,215 • Acquisition = 38,700 • Storage = 14,300
The Problem • Data sits outside Emu – how to get it in? • Not as easy as it sounds – many barriers… • Notes field used for data with uncertain placeholder. • Sites data of variable levels of atomisation depending on experience of digitiser.
The Problem • Data sits outside Emu – how to get it in? • Not as easy as it sounds – many barriers… • Notes field used for data with uncertain placeholder. • Sites data of variable levels of atomisation depending on experience of digitiser. • Approx. 95% of specimens have a record in EMu with a minimum of registration number. Once cleaned - How to update records without overwriting enhanced data • Unfamiliarity with Access • Short time periods for data cleansing.
The Solution • Google Refine • Open Refine (Github) • Personal web service • Runs in your browser
Benefits • Intuitive User Interface • Powerful editing / data manipulation functions • Can’t make mistakes! Endless undo…..! • Pick up where you left it Maintains history • Link to open-data sources to validate your data • Augment your data with free open data sources.