1 / 26

BIS TDWG Conference 28 October 2013, Florence

BIS TDWG Conference 28 October 2013, Florence. Documenting data quality in a global network: the challenge for GBIF. Éamonn Ó Tuama , Andrea Hahn, Markus Döring Global Biodiversity Information Facility (GBIF ). Outline. 1. The GBIF network and the Data Quality challenge.

jacqui
Download Presentation

BIS TDWG Conference 28 October 2013, Florence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus Döring Global Biodiversity Information Facility (GBIF)

  2. Outline 1. The GBIF network and the Data Quality challenge 2. Current DQ processes in GBIF Portal 3. DQ and GBIF Nodes 4. Addressing DQ in GBIF work programme 2014-2016

  3. GBIF is … - a connected community - an informatics infrastructure - a window on biodiversity - a tool for science and society http://www.gbif.org/resources/2311

  4. Addressing data quality Meeting the challenge of documenting data quality as the network and volume of data grow …

  5. Current GBIF Network Data Coverage http://tinyurl.com/gbifMap As of August 2013: >405,720,500 indexed recordsfrom 10,139 datasets from 493publishers and spanning a wide range of geospatial, temporal and taxonomic coverages.

  6. DQ processes in GBIF portal • Minimum obligatory metadata • Check geographic values • Check taxonomic values

  7. Packaging metadata with data

  8. Geographic attributes Verbatim data asserted to originate in USA as shared on the network

  9. Geographic attributes • 85% (355/417 mil) • georeferenced records • 2.7% (9.4 million) • georeferenced with issues • Data following quality check • Coastal regions recognised • Offshore islands recognised

  10. Taxonomic attributes Trochilidae (Hummingbirds) Using verbatim higher classification

  11. Taxonomic attributes • 56% of name usages also found in CoL Trochilidae (Hummingbirds) Classification based on authoritative sources

  12. Authoritative checklists • Fill gaps in the GBIF taxonomic backbone • Increase list of known synonyms • Increase the number of common names known to GBIF

  13. New improved algorithm for GBIF backbone taxonomy • Some taxa (mainly autonyms) do not have stable IDs • Too many accepted species created because of lack of a good database of taxonomic synonyms

  14. Working with Catalogue of Life GBIF ChecklistBank DwC-A Checklists GBIF backbone taxonomy Catalogue of Life Global Species Databases

  15. Working with Catalogue of Life GBIF ChecklistBank First backbone based on CoL feedback loop expected around December 2013 DwC-A Checklists GBIF backbone taxonomy Catalogue of Life The first two GSDs have already provided annotations: Scarabs: World Scarabaeidae Database International Legume Database & Information Service (ILDIS) Global Species Databases • 1339 names annotated • 0 rejected names • 8188 names annotated • 6825 rejected names • 541 placed names (added to ILDIS) • remaining have syntactical problems • (CoL issue, not ILDIS)

  16. Data Quality issues • Non-standardised values • Example: dwc:country (http://rs.tdwg.org/dwc/terms/country) • 29,052 distinct values for country names • Of these, 18,704 (concerning 2.2 mil records) could not be mapped to an ISO country code. • Typical issues: • Variants: 126 different values for “Italy” • Mismappings: taxon names instead of country names • Incorrect level of detail: sub-national units, non-country geographical entities

  17. Data Quality issues • Non-standardised values • Example: dwc:basisOfRecord (http://rs.tdwg.org/dwc/terms/basisOfRecord) • 625 values that cannot be interpreted at all (accounting for 13.3 mil records) • Typical issues: • Spelling variants / language variants • Mismappings • Misunderstanding definition 30 mil records with no value or “unknown” Interpretable values quite varied e.g. 31 values mapped to “observation”, 146 to “specimen”

  18. DQ and GBIF Nodes • Desirable improvements • Better metadata • Persistent IDs • Controlled vocabularies • Annotations • Independently validated datasets • Genetic validation of taxonomy

  19. DQ and GBIF Nodes • Implementing improvements • Collate experiences of all Nodes and share best practices • Build reusable DQ components (e.g., tools, vocabularies, workflows)

  20. DQ and GBIF Nodes • Next steps • Expand Data Quality Interest Group • Establish a collaboration platform

  21. Addressing Data Quality in GBIF Work Programme 2014-2016

  22. GBIF Work Programme 2014-2016 Essential Infrastructure to support Data Quality • Ensure stable identifiers for datasets and records • Provide a method for citation of data sets • Enable annotation of data

  23. GBIF Work Programme 2014-2016 Engagement of expert communities to form fitness-for-use working groups • enhancements to data standards and classes of data in use in GBIF • criteria and algorithms for evaluating data quality, fitness-for-use, coverage and completeness • content mobilisation priorities(inc. improving already mobilised data) • identification and curation of reference data sets

  24. GBIF Work Programme 2014-2016 Guidelines and supporting tools to assess and improve metadata completeness for all data • Criteria from fitness-for-use working groups • Evaluation and reporting on metadata completeness and quality • Seeking to ensure that the basis of record is clear for each data record

  25. GBIF portal upgrades to report data quality and fitness-for-use for each data set and species • Criteria from fitness-for-use working groups • Standards compliance • Metadata completeness • Presence of key data elements • Automated checks for issues and outliers • Endorsements of data publishers and data sets by Nodes, fitness-for-use working groups and other stakeholders GBIF Work Programme 2014-2016

  26. Thank you GBIF Secretariat Universitetsparken 15DK-2100 Copenhagen ØDenmark • www.gbif.org E-mail: info@gbif.org Phone: +45 3532 1470 Fax: +45 3532 1480

More Related