280 likes | 618 Views
Data Quality Measurement and Assessment1. Data QualityWhat is quality? Quality is commonly used to indicate the superiority of a manufactured good or to indicate a high degree of craftsmanship or artistry. We might define it as the degree of excellence in a product, service or performance. In ma
E N D
1. November 1Uncertainty, error and sensitivityReading: Longley:Ch 15
2. Data Quality Measurement and Assessment
1. Data Quality
What is quality?
Quality is commonly used to indicate the superiority of a manufactured good or to indicate a high degree of craftsmanship or artistry. We might define it as the degree of excellence in a product, service or performance.
In manufacturing, quality is a desirable goal achieved through management and control of the production process (statistical quality control). (Redman, 1992)
Many of the same issues apply to the quality of databases, since a database is the result of a production process, and the reliability of the process imparts value and utility to the database.
3. Why is there a concern for DQ?
Increased data production by the private sector, where there are no required quality standards. In contrast, production of data by national mapping agencies (e.g., US Geological Survey, British Ordnance Survey) has long been required to conform to national accuracy standards (i.e., mandated quality control).
Increased use of GIS for decision support, such that the implications of using low-quality data are becoming more widespread (including the possibility of litigation if minimum standards of quality are not attained).
Increased reliance on secondary data sources, due to the growth of the Internet, data translators and data transfer standards. Thus, poor-quality data is ever easier to get.
4. Who assesses DQ?
Model 1. Minimum Quality Standards.
This is a form of quality control where DQ assessment is the responsibility of the data producer. It is based on compliance testing strategies to identify databases that meet quality thresholds defined a priori.
An example is NMAS, the National Map Accuracy Standards adopted by the US Geological Survey in 1946.
This approach lacks flexibility; in some cases a particular test may be too lax while in others it may be too restrictive.
5. Model 2. Metadata Standards.
This model views error as inevitable and does not impose a minimum quality standard a priori. Instead, it is the consumer who is responsible for assessing fitness-for-use; the producer’s responsibility is documentation, i.e., “truth-in-labeling.”
An example is SDTS, the Spatial Data Transfer Standard.
This approach is flexible, but there is still no feedback from the consumer, i.e., there is a one-way information flow that inhibits the producer’s ability to correct mistakes.
6. Model 3. Market Standards.
This model uses a two-way information flow to obtain feedback from users on data quality problems. Consumer feedback is processed and analyzed to identify significant problems and prioritize repairs.
An example is Microsoft’s Feedback Wizard, a software utility that lets users email reports of map errors.
This model is useful in a market context in order to ensure that databases match users’ needs and expectations.
7. A better definition of geographical data includes the three dimensions of space, time and theme (where-when-what). These three dimensions are the basis for all geographical observation. (Berry, 1964; Sinton, 1978)
Geographical data quality is likewise defined by space-time-theme. Data quality also contains several components such as
accuracy,
precision,
consistency
completeness.
8. Accuracy
Accuracy is the inverse of error. Many people equate accuracy with quality but in fact accuracy is just one component of quality
Definition of accuracy is based on the entity-attribute-value model
Entities = real-world phenomena
Attribute = relevant property
Values = Quantitative/qualitative measurements
An error is a discrepancy between the encoded and actual value of a particular attribute for a given entity. “Actual value” implies the existence of an objective, observable reality. However, reality may be:
Unobservable (e.g., historical data)
Impractical to observe (e.g., too costly)
Perceived rather than real (e.g., subjective entities such as “neighborhoods")
9. Spatial Accuracy
Spatial accuracy is the accuracy of the spatial component of the database. The metrics used depend on the dimensionality of the entities under consideration.
For points, accuracy is defined in terms of the distance between the encoded location and “actual” location.
Error can be defined in various dimensions: x, y, z, horizontal, vertical, total
Metrics of error are extensions of classical statistical measures (mean error, RMSE or root mean squared error, inference tests, confidence limits, etc.) (American Society of Civil Engineers 1983; American Society of Photogrammetry 1985; Goodchild 1991a).
For lines and areas, the situation is more complex. This is because error is a mixture of positional error (error in locating well-defined points along the line) and generalization error (error in the points selected to represent the line) (Goodchild 1991b).
10. Precision Surveyors and others concerned with measuring instruments tend to define precision through the performance of an instrument in making repeated measurements of the same phenomenon.
A measuring instrument is precise according to this definition if it repeatedly gives similar measurements, whether or not these are actually accurate.
So a GPS receiver might make successive measurements of the same elevation, and if these are similar the instrument is said to be precise.
Precision in this case can be measured by the variability among repeated measurements.
12. Consistency
Consistency refers to the absence of apparent contradictions in a database. Consistency is a measure of the internal validity of a database, and is assessed using information that is contained within the database.
Consistency can be defined with reference to the three dimensions of geographical data.
Spatial consistency includes topological consistency, or conformance to topological rules, e.g., all one-dimensional objects must intersect at a zero-dimensional object (Kainz, 1995).
Temporal consistency is related to temporal topology, e.g., the constraint that only one event can occur at a given location at a given time (Langran, 1992).
Thematic consistency refers to a lack of contradictions in redundant thematic attributes. For example, attribute values for population, area, and population density must agree for all entities.
13. Completeness
Completeness refers to a lack of errors of omission in a database. It is assessed relative to the database specification, which defines the desired degree of generalization and abstraction (selective omission).
14. Incompleteness can be measured in space, time or theme . Consider a database of buildings in Minnesota that have been placed on the National Register of Historic Places as of the end of 1995.
Spatial incompleteness: The list contains only buildings in Hennepin County (one county in Minnesota, rather than all of Minnesota).
Temporal incompleteness: The list contains only buildings placed on the Register by June 30, 1995.
Thematic incompleteness: The list contains only residential buildings.
15.
MANAGING Uncertainty in GIS
Increasingly, concern is being expressed within the geographic information industry at our inability to effectively deal with uncertainty and manage the quality of information outputs
16. This concern has resulted from
the requirement in some jurisdictions for mandatory data quality reports when transferring data
the need to protect individual and agency reputations, particularly when geographic information is used to support administrative decisions subject to appeal
the need to safeguard against possible litigation by those who allege to have suffered harm through the use of products that were of insufficient quality to meet their needs
the basic scientific requirement of being able to describe how close their information is to the truth it represents
17. We often forget that traditional hardcopy maps contained valuable forms of accuracy statements such as reliability diagrams and estimates of positional error
Although these descriptors were imperfect, they at least represented an attempt by map makers to convey product limitations to map users, however
this approach assumed a knowledge on the part of users as to how far the maps could be trusted, and
new users of this information are often unaware of the potential traps that can lie in misuse of their data and the associated technology
The lack of accuracy estimates for digital data has the potential to harm reputations of both individuals and agencies - particularly where administrative decisions are subject to judicial review
18. The era of consumer protection also has an impact upon the issue
While we would not think of purchasing a microwave oven or video recorder without an instruction booklet and a warranty against defects, it is still common for organizations to spend thousands of dollars purchasing geographic data without receiving any quality documentation
Finally, if the collection, manipulation, analysis and presentation of geographic information is to be recognized as a valid field of scientific endeavor, then it is inappropriate that GIS users remain unable to describe how close their information is to the truth it represents
20. A Strategy for Managing Uncertainty
In developing a strategy for managing uncertainty we need to take into consideration the core components of the uncertainty research agenda.
developing formal, rigorous models of uncertainty
understanding how uncertainty propagates through spatial processing and decision making
communicating uncertainty to different levels of users in more meaningful ways
designing techniques to assess the fitness for use of geographic information and reducing uncertainty to manageable levels for any given application
learning how to make decisions when uncertainty is present in geographic information, i.e. being able to absorb uncertainty and cope with it in our everyday lives
22. Ideally, this prior knowledge permits an assessment of the final product quality specifications to be made before a project is undertaken, however this may have to be decided later when the level of uncertainty becomes known
Data, software, hardware and spatial processes are combined to provide the necessary information products
Assuming that uncertainty in a product is able to be detected and modeled, the next consideration is how the various uncertainties may best be communicated to the user
Finally, the user must decide what product quality is acceptable for the application and whether the uncertainty present is appropriate for the given task.
23. There are two choices available here
either reject the product as unsuitable and select uncertainty reduction techniques to create a more accurate product
or, absorb (accept) the uncertainty present and use the product for its intended purpose
24. Determining Product Quality
In determining the significant forms of uncertainty in a product, trade-offs in one area may have to be made at the expense of others to achieve better accuracy, for example
it may be acceptable that adjacent objects - such as a road, a railway and a river - are shown in their correct relative positions even though they have been displaced for cartographic enhancement
attribute accuracy or logical consistency may be the dominant considerations, such as in emergency dispatch applications where missing street addresses or non-intersecting road segments can cost lives
25. A well-known case described by Blakemore (1985) provides a good example of the lack of understanding of geographic data accuracy requirements
A file of boundary coordinates for British administrative districts was collected for a government department as a background for its thematic maps
The agency did not require highly accurate positional recording of the boundaries and emphasis was placed more on attribute accuracy
However, when the data became extremely popular because of its extent and convenient format, secondary users soon started to experience problems with point-in-polygon searches after some locations '... seemed suddenly to be 2 or 3 km out into the North Sea'
Clearly, convenience can sometimes override data quality concerns
26. Where the data quality requirements are unknown, it can be a useful exercise to estimate costs for each combination of data collection and conversion technologies and procedures
users often take a pragmatic approach to the 'cost vs. accuracy' issue when forced do so
Having decided which forms of uncertainty will have significant impact upon the quality of a product, some form of uncertainty assessment and communication must be undertaken
27. Ultimately, however, it is the user who must decide whether the level of uncertainty present in a product is acceptable or not
which implies there is really no such thing as bad data - just data that does not meet a specific need. For example
a road centerline product digitized from source material at a scale of 1:25000 would probably have poor positional accuracy for an urban utility manager, yet may be quite acceptable for a marketing agency planning advertising catchments
similarly, street addresses associated with the road centerline segments would probably need to be error free for an emergency dispatch system, whereas the marketer would probably be content if 80% were correct
So quality relates to the purpose for which the product is intended to be used - the essence of the 'fitness for use' concept
28. Uncertainty Reduction
Uncertainty is reduced by acquiring more information and/or improving the quality of the information available
however, it needs only to be reduced to a level tolerable to the decision maker
Methods for reducing uncertainty include
defining and standardizing technical procedures
improving education and training
collecting more or different data
increasing the spatial/temporal data resolution
field checking of observations
better data processing methods
using better models and improving procedures for model calibration