180 likes | 300 Views
Nothing Is Perfect: Error Detection and Data Cleaning. A. Townsend Peterson STOLEN SHAMELESSLY FROM Arthur Chapman …. www. gbif .org/prog/digit/data_quality/URL1124374342. Types of Errors in Biodiversity Data. Taxonomic data. Detection of Taxonomic Errors .
E N D
Nothing Is Perfect: Error Detection and Data Cleaning A. Townsend Peterson STOLEN SHAMELESSLY FROM Arthur Chapman …
Types of Errors in Biodiversity Data • Taxonomic data
Detection of Taxonomic Errors • Sine qua non – expert checks specimens and associated data • Check names against authority lists • Check names and authorities against authority lists • N.B.: Check out new capabilities for automated detection and extraction of scientific names … http://jbi.nhm.ku.edu
Spatial Error • Geographic references are invaluable in enabling analysis of biodiversity data, but are also extremely prone to problems
Data Cleaning Procedures • Assemble occurrence points for each species • Eliminate occurrence points one at a time (jackknife), and build models without each of the points available • Identify points that are • included in models only when included in the input data set • included in models not even when included in the input data set • Flag these points as suspect for further checking
Data Cleaning Test • Distributional data from the Atlas of Mexican Bird Distributions for various species • Select 18 points at random from those available • Add two random points • Simulates 10% error rate • Use data-cleaning procedure to see if random points could be identified as ‘erroneous’
Example – Crax rubra Successfully identified the 2 random points included in the model
Example – Rauvolfia paraensis Identified one point as outlier. Proved to be an undescribed species
Error Flagging • Never possible to clean completely—what matters is signal to noise ratio • No substitute for inspection and detailed study by specialists • HOWEVER, we can • Detect records with internal inconsistencies that clearly represent error in some field • Detect records with high probability of including errors owing to unusual characteristics • Flag those records for later checking and correction