320 likes | 329 Views
This article discusses the data quality challenges in the Canadensys network of occurrence records and provides examples, tools, and solutions. It covers data entry, aggregation, and processing processes, as well as the importance of data usability and the involvement of various actors within the network. The article also highlights the hopes and expectations for the Canadensys network, including the use of specialized resources and services, shared resources for multilingual data, and improved reporting and coordination.
E N D
Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse& Peter Desmet
Game plan • Introduction to Canadensys • Data quality @ Canadensys • Canadensysprocessing solutions • NumbersfromCanadensys • Hopes and expectations
A Network Of people and collections
Canadensys Headquarters Universitéde MontréalBiodiversity Centre
Data qualityrelated activities From an aggregator perspective
Duringdata entry • Help to avoidtypographicalerrors • Help to convert verbatim data Actor : data entry person
Before publication • Detect file characterencoding issue • Detectduplicate or missingIDs Actor : data publisher Previous Activity: Data entry
Duringaggregation Processdata: validation, cleaning Producestructuredreports : qualitycontrol Actor : data aggregator Previous Activity: Before publication
Afteraggregation Allow and facilitatecommunity feedback Help data publisher to integrate corrections Actor : users and community Previous Activity: Aggregation
Canadensystoolsduring data entry data.canadensys.net/tools
Why do weprocess data? • Enrich our Explorer, http://data.canadensys.net • Provide structured reports to data providers • Help identify records that need re-examination • Help to improve data entry procedure
Processing solutionsNarwhals to the rescue Narwhal image Public Domain
The narwhal-processor approach • Single fieldprocessing to allow complex processing (combined fields) • Processors withcommon interfaceease integration and usage • Collaboration https://github.com/Canadensys/narwhal-processor
Data usabilityafterprocessing 7% of providedcountry text
Data usabilityafterprocessing 7% of providedcountry text 16% of provided state/province text
Data usabilityafterprocessing 7% of providedcountry text 16% of provided state/province text 4% of providedcoordinates
Data usabilityafterprocessing 7% of providedcountry text 16% of provided state/province text 4% of providedcoordinates 42% of provided dates
ProjectsWith Data Quality Tools Atlas of living Australia GBIF Norway, GBIF Spain, National BiodiversityNetwork, BioVeL … GBIF libraries Most nodes have theirown data quality routine
We do not want to • Maintain taxonomic authority files • Maintain country, province and city lists
We prefer to • Efficiently use specialized resources/services • Provide report, quality indices
Help from Semantic Web • Data in other languages (French, Spanish, …)should not be flaggedas error • Misspellings should be shared as a common resource (e.g. SKOS) • Understand historical data (e.g. collected in USSR in 1980)
Reporting and log • DarwinCore annotations for processed data • Sharedvocabulary for structured reportsand quality indices
Summary Tools available for sharing Use, review, contribute Opportunityfor broad coordination and increasedefficiencies
Thanks Anne Bruneau, Institutde recherche en biologievégétaleand Départementde Sciences Biologiques, Université de Montréal
Contact http://www.canadensys.net http://github.com/Canadensys @Canadensys Gulogulo, Larry Master (www.masterimages.org)