190 likes | 314 Views
Data Quality Tools for use in Georeferencing Natural History Location Data. Renato De Giovanni Arthur Chapman Robert Hijmans John Wieczorek Alexandre Marino Sidnei de Souza Other BioGeomancer consortium members. www.biogeomancer.org. BioGeomancer. Introduction - Web Site.
E N D
Data Quality Tools for use in Georeferencing Natural History Location Data Renato De Giovanni Arthur Chapman Robert Hijmans John Wieczorek Alexandre Marino Sidnei de Souza Other BioGeomancer consortium members
www.biogeomancer.org BioGeomancer • Introduction - Web Site
Aim of the presentation To provide an overview about how BioGeomancer is planning to address data quality issues on georeferenced records.
Interaction Between Components GEOREFERENCE MANAGER geoparsing engines BG C O N T R O L L E R localities PARSER 1 rs record set interpretations PARSER 2 Scenario 1 georeferenced record set DATA TESTER FRAMEWORK validated georeferenced record set Test 1 Test 3 VALIDATION MANAGER Test 2
Interaction Between Components BG C O N T R O L L E R georeferenced record set Scenario 2 rs DATA TESTER FRAMEWORK validated georeferenced record set Test 1 Test 3 VALIDATION MANAGER Test 2
Data Tester Framework Open source Java framework jointly funded by BioGeomancer (Gordon and Betty Moore Foundation) and GBIF. Main features include: • Configurable: tests can be plugged, unplugged, and parameterised. • Extensible: new tests can always be implemented. • Accepts record sets coming in different formats from different sources (XML, Relational DB, etc).
Data Tester Framework • Results come out in the form of “tags” and can be handled programatically. • Tags can be “attached” to parts of a record, to an entire record, or to the whole record set. • Tags have an associated level (the same ones known from log4j: error, warn, info, debug, etc.) and they have an optional value or message. • The framework tries to be generic enough so that it can be used in different contexts.
Data Tester Framework ERROR INFO WARNING DEBUG
Planned Tests Error Detection: • Detect inconsistencies between coordinates and administrative regions. • Detect inconsistencies between coordinates and elevation. • Detect collector itinerary inconsistencies. Outlier Detection: • Detect statistical outliers in geographic and environmental space. • Detect outliers in ecological space.
Geographic Error Detection • Check that the coordinates are consistent with the administrative regions provided by the original records. • Accept country plus four levels of administrative regions. • Additional test can be performed to check that the coordinates are consistent with the species habitat(marine x terrestrial). • Will also take into account uncertainty values.
Elevation Error Detection • Check that the coordinates correspond to an elevation consistent with the one provided by the original records. • Uncertainty values will be considered. e-col y Elevation e-var x e-map latitude elevation longitude
Collector Itinerary Error Detection • Check that all collecting events from each collector are geographically consistent with the original collecting dates. • Test will need to know what are the acceptable distances to be travelled over one day for different periods in our history.
Geographic Outlier Detection • Statistical method based on reverse-jackknifing procedure. • Will run with latitude and longitude values. • Separate analysis for each taxon.
Environmental Outlier Detection • The same statistical method. • Uses environmental data associated with the localities. • Separate analysis for each taxon.
Ecological Space Outlier Detection • Check that the coordinates fall into a suitable region according to a distribution map for the respective species. • Distribution map repository?
Important Notes • These methods can only highlight problems, not solve them. • Long term solution is inspection, editing, and annotation of the suspect places at the data provider side.
Current Status • Data Tester framework ready to be used and publicly available:www.sf.net/projects/gbif(cvs module called “DataTester”) • Implemented initial version of a generic outlier test using the reverse-jackknifing procedure. • Next step is to integrate it with BioGeomancer.
DIVA-GIS www.diva-gis.org