1 / 40

Data Quality Management

Data Quality Management. Geospatial errors can cause real-life problems!. http://www.brownsmarina.com/fun.html. One management strategy …. Murphy’s Law. Ignoring data quality issues usually doesn’t work very well. Some geospatial goofs. This one’s worse….

cale
Download Presentation

Data Quality Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality Management Geospatial errors can cause real-life problems! http://www.brownsmarina.com/fun.html CS 128/ES 228 - Lecture 14a

  2. One management strategy … CS 128/ES 228 - Lecture 14a

  3. Murphy’s Law Ignoring data quality issues usually doesn’t work very well CS 128/ES 228 - Lecture 14a

  4. Some geospatial goofs CS 128/ES 228 - Lecture 14a

  5. This one’s worse… Mars Climate Orbiter (MCO) was lost on 23 Sep 1999 when it failed to enter an orbit around Mars, instead crashing into the planet, destroying the $125 million craft, part of a $328 million mission http://www.boeing.com/companyoffices/gallery/images/space/d2_mars_climate_orbiter_01.htm The root cause of the failure was a computer program that was supposed to provide its output in newton seconds (N·s) but instead provided pound-force seconds (lbf·s). http://lamar.colostate.edu/~hillger/unit-mixups.html#mco CS 128/ES 228 - Lecture 14a

  6. And these are really bad! Just a 'map error'? The China Daily website carries a cartoon of the damaged US plane at Hainan Island's airbase and asks sarcastically if Sunday's collision "might be due to another map error“ - a reference to the US bombing of the Chinese embassy in Belgrade in 1999. "Last time it's due to a map error, and this time another map error? What about the next?” http://news.bbc.co.uk/1/hi/world/monitoring/media_reports/1260185.stm CS 128/ES 228 - Lecture 14a

  7. What is error? • “Error is the physical difference between the real world and the GIS facsimile” -Heywood, Cornelius, & Carver, p. 178 • Errors are impossible to avoid, but can be managed CS 128/ES 228 - Lecture 14a

  8. A Data Management Model Data acquisition Data representation & analysis Data outputs CS 128/ES 228 - Lecture 14a

  9. Data acquisition errors Scientists use the term “error” for two very different concepts: • natural variability • actual mistakes CS 128/ES 228 - Lecture 14a

  10. Take a sidewalk … What’s its width? 1.77, 1.82, 1.69 … meters • “Error” (natural variability):mean width = 1.76 m, range 1.69 - 1.82 • “Error” (actual mistake): mean = 1.67 ft CS 128/ES 228 - Lecture 14a

  11. Accuracy vs. Precision Figure 10.1, An Introduction to Geographic Information Systems by Heywood, Cornelius, and Carver CS 128/ES 228 - Lecture 14a

  12. Random error vs. Bias CS 128/ES 228 - Lecture 14a

  13. Where does lack ofprecision come from? • Natural variability • Poor input assumptions • Imprecise equipment • Sloppy measurement • Accumulated error CS 128/ES 228 - Lecture 14a

  14. Random error is often “normal” mean Standard deviation CS 128/ES 228 - Lecture 14a

  15. 95% of observations ±2 s.d. mean Mean + 2 s.d. Mean + 2 s.d. CS 128/ES 228 - Lecture 14a

  16. Means have smaller variability than single measurements S. E. (mean) = standard deviation √n If n = 4 √n = ? CS 128/ES 228 - Lecture 14a

  17. Where does lack of accuracy come from? • Dubious source data • Incompatible source data Data collected at different times through different methods, possibly in different formats • Bias CS 128/ES 228 - Lecture 14a

  18. How can we fix it? • Benchmarks ex. National Geodetic Survey maintains a database of survey “monuments” at http://www.ngs.noaa.gov/ cgi-bin/datasheet.prl • Otherwise – just measure variability http://upload.wikimedia.org/wikipedia/commons/thumb/6/66/USCGS-E134.jpg/617px-USCGS-E134.jpg CS 128/ES 228 - Lecture 14a

  19. Data representation errors • Transference error • Data storage errors • Analysis errors CS 128/ES 228 - Lecture 14a

  20. Where does transference error come from? • Typos, etc. • Less likely with automated data collection and transformation • Can be prevented through diligence and software “sanity” checks • Format conversion • Many inter-format conversions cause loss/corruption of data/information CS 128/ES 228 - Lecture 14a

  21. Something got lost in the translation • “geographic information systems is an interesting course” • “지리적인 정보 시스템은 재미있는 과정 이다 ” • “The geography information system is the process which is fun” Thanks to http://babelfish.altavista.com/babelfish/tr CS 128/ES 228 - Lecture 14a

  22. Raster Vector conversions Aliasing is an intrinsic problem of GIS’s CS 128/ES 228 - Lecture 14a

  23. Digitization errors CS 128/ES 228 - Lecture 14a

  24. Topology errors Figure 10.5, An Introduction to Geographic Information Systems by Heywood, Cornelius, and Carver CS 128/ES 228 - Lecture 14a

  25. Data storage/retrieval errors Hardware failure Hardware Limitations CS 128/ES 228 - Lecture 14a

  26. What is a hardware limitation? • Numbers in a computer are stored in a finite number of bits. • Using too few bits can cause round-off error. Box 9.2, Principles of Geographic Information Systems by Burrough and McDonnell CS 128/ES 228 - Lecture 14a

  27. Where do errors of data rot come from? • Link rot Not Found The requested URL /cs/dlevine/ was not found on this server. Apache/1.3.27 Server at www.xxx.edu Port 80 • Poor “style” • E.g. “Employees may appeal to Sr. Carney” as opposed to “Employees may appeal to the President of the University” CS 128/ES 228 - Lecture 14a

  28. Where do errors of analysis come from? How long do you have? … • Mistaken queries • Analyzing layers with different datums or coordinate systems • Comparing attributes with incompatible units CS 128/ES 228 - Lecture 14a

  29. More errors of analysis … • Inappropriate resolution • Combining rasters/vectors with different resolutions • Using exact/abrupt surface fits when approx./gradual is appropriate (or vice versa) CS 128/ES 228 - Lecture 14a

  30. Data output errors • Maps • Reports CS 128/ES 228 - Lecture 14a

  31. Junket at taxpayers’ expense? Did a politician misuse federal funds to visit Alaska on the way to official business in Japan? Muekrcke. Map Use, 2nd ed. p. 395 CS 128/ES 228 - Lecture 14a

  32. No - Intentional map error* *More like lying with maps! Muekrcke. Map Use, 2nd ed. p. 395 CS 128/ES 228 - Lecture 14a

  33. Should maps be as accurate as possible? • Map simplification • Features are omitted • Area features become lines or points • Exaggeration • Features’ apparent size is “increased” (e.g. hydrants) • Features’ separation is increased on the map for visibility Must Mapquest be accurate? CS 128/ES 228 - Lecture 14a

  34. Reporting significance of findings • Hypothesis testing • What does the term “significant” mean to scientists? CS 128/ES 228 - Lecture 14a

  35. Are two means really different? These two normal distributions have a very large overlap. The means of the two populations are notsignificantly different, because the overlap is > 5% of the area under the curves. t would be very small. http://www.steve.gb.com/science/statistics.html#t CS 128/ES 228 - Lecture 14a

  36. What about these two means? http://www.steve.gb.com/science/statistics.html#t CS 128/ES 228 - Lecture 14a

  37. These means are also significantly different - why? http://www.steve.gb.com/science/statistics.html#t CS 128/ES 228 - Lecture 14a

  38. How do we actually test for statistical differences? Student’s t-test t = difference in means measure of variability CS 128/ES 228 - Lecture 14a

  39. Three Commandments of Data Reporting • Thou Shalt Not … • Report insignificant digits(or omit significant trailing zeros) • Report means without also reporting sample sizes and variability • Report results as “significant” (or even worth talking about) without doing the appropriate statistical tests. CS 128/ES 228 - Lecture 14a

  40. How do we minimize (NOT avoid) error? “CONSTANT VIGILANCE” http://news.bbc.co.uk/1/shared/spl/hi/pop_ups/05/entertainment_goblet_of_fire/html/3.stm -- “Mad Eye” Moody Defense Against The Dark Arts Instructor Hogwarts School of Witchcraft and Wizardry CS 128/ES 228 - Lecture 14a

More Related