1 / 32

Data Quality Challenges in the Canadensys Network: Examples, Tools, and Solutions

This article discusses the data quality challenges in the Canadensys network of occurrence records and provides examples, tools, and solutions. It covers data entry, aggregation, and processing processes, as well as the importance of data usability and the involvement of various actors within the network. The article also highlights the hopes and expectations for the Canadensys network, including the use of specialized resources and services, shared resources for multilingual data, and improved reporting and coordination.

lally
Download Presentation

Data Quality Challenges in the Canadensys Network: Examples, Tools, and Solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse& Peter Desmet

  2. Game plan • Introduction to Canadensys • Data quality @ Canadensys • Canadensysprocessing solutions • NumbersfromCanadensys • Hopes and expectations

  3. A Network Of people and collections

  4. Canadensys Headquarters Universitéde MontréalBiodiversity Centre

  5. data.canadensys.net/vascan

  6. data.canadensys.net/ipt

  7. data.canadensys.net/explorer

  8. Data qualityrelated activities From an aggregator perspective

  9. Duringdata entry • Help to avoidtypographicalerrors • Help to convert verbatim data Actor : data entry person

  10. Before publication • Detect file characterencoding issue • Detectduplicate or missingIDs Actor : data publisher Previous Activity: Data entry

  11. Duringaggregation Processdata: validation, cleaning Producestructuredreports : qualitycontrol Actor : data aggregator Previous Activity: Before publication

  12. Afteraggregation Allow and facilitatecommunity feedback Help data publisher to integrate corrections Actor : users and community Previous Activity: Aggregation

  13. Canadensystoolsduring data entry data.canadensys.net/tools

  14. Why do weprocess data? • Enrich our Explorer, http://data.canadensys.net • Provide structured reports to data providers • Help identify records that need re-examination • Help to improve data entry procedure

  15. Data processing

  16. Processing solutionsNarwhals to the rescue Narwhal image Public Domain

  17. The narwhal-processor approach • Single fieldprocessing to allow complex processing (combined fields) • Processors withcommon interfaceease integration and usage • Collaboration https://github.com/Canadensys/narwhal-processor

  18. Data usabilitybeforeprocessing

  19. Data usabilityafterprocessing 7% of providedcountry text

  20. Data usabilityafterprocessing 7% of providedcountry text 16% of provided state/province text

  21. Data usabilityafterprocessing 7% of providedcountry text 16% of provided state/province text 4% of providedcoordinates

  22. Data usabilityafterprocessing 7% of providedcountry text 16% of provided state/province text 4% of providedcoordinates 42% of provided dates

  23. Datausabilityincluding processed data

  24. ProjectsWith Data Quality Tools Atlas of living Australia GBIF Norway, GBIF Spain, National BiodiversityNetwork, BioVeL … GBIF libraries Most nodes have theirown data quality routine

  25. Hopes and expectations

  26. We do not want to • Maintain taxonomic authority files • Maintain country, province and city lists

  27. We prefer to • Efficiently use specialized resources/services • Provide report, quality indices

  28. Help from Semantic Web • Data in other languages (French, Spanish, …)should not be flaggedas error • Misspellings should be shared as a common resource (e.g. SKOS) • Understand historical data (e.g. collected in USSR in 1980)

  29. Reporting and log • DarwinCore annotations for processed data • Sharedvocabulary for structured reportsand quality indices

  30. Summary Tools available for sharing Use, review, contribute Opportunityfor broad coordination and increasedefficiencies

  31. Thanks Anne Bruneau, Institutde recherche en biologievégétaleand Départementde Sciences Biologiques, Université de Montréal

  32. Contact http://www.canadensys.net http://github.com/Canadensys @Canadensys Gulogulo, Larry Master (www.masterimages.org)

More Related