1 / 18

Nothing Is Perfect: Error Detection and Data Cleaning

Nothing Is Perfect: Error Detection and Data Cleaning. A. Townsend Peterson STOLEN SHAMELESSLY FROM Arthur Chapman …. www. gbif .org/prog/digit/data_quality/URL1124374342. Types of Errors in Biodiversity Data. Taxonomic data. Detection of Taxonomic Errors .

libba
Download Presentation

Nothing Is Perfect: Error Detection and Data Cleaning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nothing Is Perfect: Error Detection and Data Cleaning A. Townsend Peterson STOLEN SHAMELESSLY FROM Arthur Chapman …

  2. www.gbif.org/prog/digit/data_quality/URL1124374342

  3. Types of Errors in Biodiversity Data • Taxonomic data

  4. Detection of Taxonomic Errors • Sine qua non – expert checks specimens and associated data • Check names against authority lists • Check names and authorities against authority lists • N.B.: Check out new capabilities for automated detection and extraction of scientific names … http://jbi.nhm.ku.edu

  5. Spatial Error • Geographic references are invaluable in enabling analysis of biodiversity data, but are also extremely prone to problems

  6. Georeferencing Errors

  7. Georeferencing Error

  8. Collector Itineraries

  9. 100 km

  10. Using Ecological Information

  11. Data Cleaning Procedures • Assemble occurrence points for each species • Eliminate occurrence points one at a time (jackknife), and build models without each of the points available • Identify points that are • included in models only when included in the input data set • included in models not even when included in the input data set • Flag these points as suspect for further checking

  12. Data Cleaning Test • Distributional data from the Atlas of Mexican Bird Distributions for various species • Select 18 points at random from those available • Add two random points • Simulates 10% error rate • Use data-cleaning procedure to see if random points could be identified as ‘erroneous’

  13. Example – Crax rubra Successfully identified the 2 random points included in the model

  14. Example – Rauvolfia paraensis Identified one point as outlier. Proved to be an undescribed species

  15. Error Flagging • Never possible to clean completely—what matters is signal to noise ratio • No substitute for inspection and detailed study by specialists • HOWEVER, we can • Detect records with internal inconsistencies that clearly represent error in some field • Detect records with high probability of including errors owing to unusual characteristics • Flag those records for later checking and correction

More Related