430 likes | 624 Views
From Data to Uncertainty Principles of Data Quality. Albatrosses, Kaikoura, New Zealand. Arthur D. Chapman. Australian Biodiversity Information Services. The Data Equation. Oceans of Data. Praia de Forte, Brazil. The Data Equation. Rivers of Information. Doubtful Sound, New Zealand.
E N D
From Data to UncertaintyPrinciples of Data Quality Albatrosses, Kaikoura, New Zealand Arthur D. Chapman Australian Biodiversity Information Services Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
The Data Equation Oceans of Data Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Praia de Forte, Brazil
The Data Equation Rivers of Information Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Doubtful Sound, New Zealand
The Data Equation Streams of Knowledge Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Wasatch, Utah, USA
The Data Equation Drops of Understanding Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 (Nix 1984)
Taking Data to Information Species Data Species Data Environmental Data Information Crab Florianopolis, Brazil Armeria maritima Argentina Brown Algae, Argentina Rock Cormorants Argentina Algae, New zealand Temp Range Wandering Albatros, NZ Orca, San Francisco Corals, Australia Rain June GIS Data Rain Jan Information Decisions Policy Conservation Management Models Decision Support Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Why do we need to use models? Cont. USA Population: 292 million Collections: 0.5-1 billion Plants: 18,000 Reptiles: 350 Mammals: 428 Insects: ?150,000 Brazil Population: 172 million Collections: 50 million Plants: 70,000 Reptiles: 470-650 Mammals: 394 Insects: ?1.7 million Australia Population: 20 million Collections: 35 million Plants: 20,000 Reptiles: 850-890 Mammals: 305 Insects: ?220,000 From OBIS 2004 The Need for Modelling Oceans Population : ~ 0 Collections: ?10 million Plants: ?? Vertebrates: ?? Invertebrates: ?? Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
What do we mean by ‘Data Quality’? An essential or distinguishing characteristic necessary for [spatial] data to befit for use. SDTS 02/92 The general intent of describing the quality of a particular dataset or record is to describe thefitnessof that dataset or recordfor a particular usethat one may have in mind for the data. Chrisman, 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data quality - fitness for use? • Fitness for use • Does species ‘A’ occur in Tasmania? • Does species ‘A’ occur in National Park ‘y’ Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
The Biological Data Domains Plant and animal specimen data held in museums provide a vast information resource, providing not only present day information on the locations of these entities, but also historic information going back several hundred years (Chapman and Busby 1994). Errors can occur in any one of these Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Loss of data quality Loss of data quality can occur at many stages: • At the time of collection • During digitisation • During documentation • During storage and archiving • During analysis and manipulation • At time of presentation • And through the use to which they are put Don’t underestimate the simple elegance of quality improvement. Other than teamwork, training, and discipline, it requires no special skills. Anyone who wants to can be an effective contributor. (Redman 2001). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Principles of data quality The Vision: • It is important for organizations to have a vision with respect to having good quality data. • As well as a vision, an organization needs a policy to implement that vision. • And a strategy for implementation Experience has shown that treating data as a long-term asset and managing it within a coordinated framework produces considerable savings and ongoing value.(NLWRA 2003). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
The data quality vision A Vision may involve • Not reinventing information management wheels • Looking for efficiencies in data collection and quality control procedures • Sharing of data, information and tools • Using existing standards or developing new, robust standards • Fostering the development of networks and partnerships • Presenting a sound business case for data collection and management • Reducing duplication in data collection and data quality control • Looking beyond immediate use and examining requirements of users • Ensuring that good documentation and metadata procedures. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Strategies Short term - Data that can be assembled and checked over a 6-12 month period Intermediate - Data that can be entered over about an 18-month period with small investment of resources - Data that can be checked using simple in-house methods Long Term - Data that can be entered and/or checked over a longer time frame, using collaborative arrangements Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Information management chain Assign responsibility for the quality of data to those who create them. If this is not possible, assign responsibility as close to data creation as possible (Redman 2001) From: Chapman 2004 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data Cleaning Principles -1 Data ownership and custodianship not only confers rights to manage and control access to data, it confers responsibilities for its management, quality control and maintenance. Custodians also have a moral responsibility to superintend the data for use by future generations (Chapman 2004) • Planning is essential • develop a vision, a policy and strategy • Total Data Quality Management Cycle 1 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data Cleaning Principles - 2 • Organising Data improves efficiency • The organizing of data prior to data checking, validation and correction can improve efficiency and considerably reduce the time and costs of data cleaning. • For example, by sorting data on location, efficiency gains can be achieved through checking all records pertaining to the one location at the same time, rather than going back and forth to key references. • Similarly, by sorting records by collector and date, it is possible to spot errors where a record may be at an unlikely location for that collector on that day. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data Cleaning Principles - 3 • Prevention is better than cure • It is far cheaper and more efficient to prevent an error from happening, than to have to detect it and correct it later. It is also important that when errors are detected, that feedback mechanisms ensure that the error doesn’t occur again during data entry, or that there is a much lower likelihood of it re-occurring. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Asplenium bulbiferum, New Zealand
Data Cleaning Principles - 4 • Responsibility belongs to everyone • (collector, custodian and user) • The principle responsibility belongs to the data custodian • The collector has responsibility to respond to the custodian’s questions when the custodian finds errors or ambiguities that may refer back to the original information supplied by the collector. These may relate to ambiguities on the label, errors in the date or location, etc. • The user also has a key responsibility to feed back to custodians information on any errors or omissions they may come across, including errors in the documentation associated with the data. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data Cleaning Principles - 5 Yours is not the only organization that is dealing with data quality. • Partnerships improve efficiency • By developing partnerships, many data validation processes won’t need to be duplicated, errors will more likely be documented and corrected, and new errors won’t be incorporated by inadvertent “correction” of suspect records that are not in error. • Partnerships with: • Data collectors • Other institutions with duplicate collections • Like-minded institutions developing tools, standards and software • Key data brokers (e.g. OBIS, GBIF) • Data users (good feedback mechanisms) • Statisticians and data auditors Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data Cleaning Principles - 6 • Prioritisation reduces duplication • Prioritisation helps reduce costs and improves efficiency. It is often of value to concentrate on those records where lots of data can be cleaned at the lowest cost. • For example, those that can be examined using batch processing or automated methods, before working on the more difficult records. • By concentrating on those data that are of most value to users, there is also a greater likelihood of errors being detected and corrected. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Tierra del Fuego, Argentina
Prioritising data quality procedures Not all data are created equal, so focus on the most important, and if data cleaning is required, make sure it never has to be repeated (Chapman 2004). • Focus on most critical data first • Concentrate on discrete units (taxonomic, geographic, etc.) • Ignore data that are not used or for which data quality cannot be guaranteed • Consider data that are of broadest value, are of greatest benefit to the majority of users and are of value to the most diverse of uses • Work on those areas whereby lots of data can be cleaned at the lowest cost (e.g. through use of batch processing). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data Cleaning Principles -7 • Set targets and performance measures • Performance measures are a valuable addition to quality control procedures, • They help an organization manage their data cleaning processes. • Performance measures may include statistical checks on the data (for example, 95% of all records are within 1,000 meters of their reported position), on the level of quality control (for example – 65% of all records have been checked by a qualified taxonomist within the previous 5 years; 90% have been checked by a qualified taxonomist within the previous 10 years). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data Cleaning Principles - 8 • Minimise duplication and re-working of data • Duplication is a major factor with data cleaning in most organizations. • Many organizations add the geocode at the same time as they database the record. As records are seldom sorted geographically, this means that the same locations will be chased up a number of times. • By carrying out the georeferencing as a special operation, records from similar locations can then be sorted and then the appropriate map-sheet only has to be extracted once. • Some institutions also use the database itself to help reduce duplication by searching to see if the location may already have been georeferenced . Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Nothofagus antarctica, Argentina
Data Cleaning Principles - 9 • Feedback is a two-way street • Users of the data will inevitably carry out error detection, and it is important that they feedback the results to the custodians. • It is essential that data custodians encourage feedback from users of their data, and take the feedback that they receive seriously. • Data custodians also need to feed back information on errors to the collectors and data suppliers where relevant. • In this way there is a much higher likelihood that the incidence of future errors will be reduced and the overall data qualityimproved. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data Cleaning Principles - 10 • Education and training improves techniques • Poor training, especially at the data collection and data entry stages of the Information Quality Chain, is the cause of a large proportion of the errors in primary species data. • Good training of data entry operators can reduce the error associated with data entry considerably, reduce data entry costs and improve overall data quality. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Brown Algae, Argentina
Data Cleaning Principles - 11 • Accountability, Transparency and Audit-ability are important • Haphazard and unplanned data cleaning exercises are very inefficient and generally unproductive. • Within data quality policies and strategies – clear lines of accountability for data cleaning need to be established. • To improve the “fitness for use” of the data and thus their quality, data cleaning processes need to be transparent and well documented with a good audit trail to reduce duplication and to ensure that once corrected, errors never re-occur. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Data Cleaning Principles - 12 • Documentation is the key to good data quality • Without good documentation, it is difficult for users to determine the fitness for use of the data and difficult for custodians to know what and by whom data quality checks have been carried out. • Documentation is generally of two types. • The first is tied to each record and records what data checks have been done and what changes have been made and by whom. • The second is the metadata that records information at the dataset level. • Both are important, and without them, good data quality is compromised. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Recording Accuracy and Error • Additional Accuracy Fields • Preferably in meters (Point-Radius) Documenting Validation tests • Who • What • How Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Methods for geocode validation • Internal Database Checks • External Database Checks • Outliers in Geographic Space - GIS • Outliers in Environmental Space - Models • Statistical outliers Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Butterfly, Florida, USA
Internal/External Database Checks • Logical inconsistencies within the database • Checking one field against another • Text location vs geocode or District/State • Checking one database against another • Gazetteers • DEM • Collectors Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Magellanic Penguin, Argentina
Error Error is inescapable and it should be recognised as a fundamental dimension of data. Chrisman 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Geographic outliers - GIS Country, State, named district, etc. Gazetteer of Brazilian localities Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Diva-GIS - Outlier • Reverse jack-knifing technique • Threshold value t = 0.95(n) +0.2 www.diva-gis.org Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
CRIA-Data Cleaning http://splink.cria.org.br/dc/ Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Principal Components Analysis - FloraMap Image from FloraMap (Jones and Gladkov 2001) showing use of Principal Components Analysis to identify an outlier in Rauvolfia littoralis specimen data. A. Principal Components Analysis B. Specimen record.C. Mapped specimen. D.Climate profile Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Cumulative Frequency Curves - DivaGiS Results from Diva-GIS showing the use of the Cumulative Frequency curve from BIOCLIM to identify possible geocoding errors in Rauvolfia littoralis. A1 and A2 show possible outliers in climate space, B1 and B2 the corresponding mapped records. The Blue lines represent the 97.5 percentile Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Environmental Outliers • Cumulative Frequency Curves Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Errors in data Although most data gathering disciplines treat error as an embarrassing issue to be expunged, the error inherent in (spatial) data deserves closer attention and public understanding. Chrisman, 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Errors in data - 2 In general, error must not be treated as a potentially embarrassing inconvenience, because error provides a critical component in judging fitness for use. Chrisman, 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Mizodendrum sp., Argentina
Future Challengers Future Challengers • Improved data quality • Improved documentation of data • Improved access to distributed data • Improved methods for modelling in aquatic (including marine) environments • Decision Support Systems • Enlightened Policy / Decision Makers!!! Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Thank You… Questions? Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
Acknowledgements • Brazilian Biota/FAPESP Virtual Biodiversity Institute Program • Reference Centre for Environmental Information, Brazil (CRIA) • Global Biodiversity Information Facility (GBIF) • UNESCO • Wesleyan University, Connecticut, USA • Peabody Museum, Yale University, USA • ETI, Holland • UN Food and Agriculture Organization (FAO) • Environmental Resources Information Network, Australia (ERIN) • Commission on Data for Science and Technology (CODATA) Ocean Biodiversity Informatics, Hamburg 29 Nov 2004