1 / 30

The identification of exceptional values in the ESPON database

This seminar discusses the identification of exceptional values in the ESPON database through two case studies: detecting logical input errors and statistical outliers.

Download Presentation

The identification of exceptional values in the ESPON database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paul Harris Martin Charlton National Centre for Geocomputation NUIM Maynooth Ireland Madrid seminar - 10/6/10 The identification of exceptional values in the ESPON database

  2. ESPON DB data Identifying exceptional values Case study 1 (detecting logical input errors) Case study 2 (detecting statistical outliers) Next things to do.. Outline

  3. Socio-economic, land cover,… • Continuous, categorical, nominal, ordinal,…. • Spatial support: • Area units – NUTS 0/1/2/23/3 • (whose boundaries may also change over time) • Temporal support: • Commonly, yearly units (with only a short time series) 1. ESPON DB data

  4. Define two types: • Logical input errors • (e.g. a negative unemployment rate) • Statistical outliers • (e.g. an unusually high unemployment rate) • Two-stage identification algorithm: • Stage 1: identify input errors via mechanical techniques • Stage 2: identify outliers via statistical techniques 2. Identifying exceptional values

  5. Stage 1: Identify logical Input Errors

  6. Usually detected using some logical, mathematical approach • Statistical detection may also help… • Typical input errors: • Impossible values (e.g. negatives, fractions…) • Repeated data for different variables • Data displaced between or within columns • Data swapped between or within columns • Wrong NUTS code or name • Wrong NUTS regions used (e.g. for 1999 instead of 2006) • Missing value code (e.g. 9999 treated as a true value) • Etc. Logical input errors…

  7. Detect input errors mathematically (& statistically) • Flag observations if they are likely input errors • If possible - correct them • More likely - consult an expert on the data • Once happy - go to stage 2 - assume data is error-free Our approach…

  8. Stage 2: Identify statistical outliers

  9. Types of outliers….

  10. There is no single ‘best’ outlier detection technique, so… • Apply a representative selection of outlier detection techniques (which are simple & robust) • Flag an observation if it is a likely outlier according to each technique • Build up a weight of evidence for the likelihood of a given observation being statistically outlying • Suggest what type of outlier it is likely to be • - aspatial, spatial, temporal, relationship, some mixture… • Consult an expert on the data to decide on the appropriate course of action • Here’s an example using nine techniques & three observations… Our approach…

  11. 3. Case study 1 (detecting logical input errors) Data • Data at NUTS3 level (1351 observations/regions) • Variables: • GDP evolution (2000 to 2005) (%age) • Calculated using 4 other variables: • 205 logical input errors deliberately introduced to: • NUTS codes & the 4 variables used to calculate GDP evolution only • ~ 15% of data infected

  12. Performance results False negatives - 13.2% (e.g. in Italy) False positives - 2.0% (e.g. in Spain) Overall misclassification rate - 3.7%

  13. Consequences if we had ignored input errors….

  14. 4. Case study 2 (detecting statistical outliers) Data • Data at NUTS23 level for eight years: 2000-2007 • For each year - ‘unemployment rate’ calculated [Unemployment population)/(Active population)] • 8 variables at each of 790 regions = 6320 obs. • Data checked for input errors - i.e. stage 1 done

  15. Presentation of results… • For brevity… • Lets say - we only need at least one of 8 time-specific unemployment values in a region to be outlying… • (But we can identify outliers by year too)

  16. Results: 1 boxplot statistics(aspatial & univariate)

  17. Results: 2 Hawkins’ test(spatial & univariate)

  18. Results: 3 time series statistics(temporal & univariate)

  19. Results: 4 MLR residuals(aspatial linear relationships)

  20. Results: 5 LWR residuals(aspatial nonlinear relationships)

  21. Results: 6 GWR residuals(spatial nonlinear relationships)

  22. Results: 7 PCA residuals(aspatial linear relationships & model-free)

  23. Results: 8 LWPCA residuals(aspatial nonlinear relationships & model-free)

  24. Results: 9 GWPCA residuals(spatial nonlinear relationships & model-free)

  25. Summary of results: weight of evidence

  26. Preliminary performance results • Infected ~ 5% of the data with ‘outliers’ & repeated the analysis on this ‘infected’ data… • False negatives: 10.3% • False positives: 34.3% • Overall misclassification rate: 26.1% • Problems: • Difficult to guarantee that our infections actually produce outliers… • The data already contains outliers (as shown)

  27. 1. Other ways of performance testing our approach • Simulated data with known properties? • Statistical theory (or properties)? • 2. Refining each of our nine chosen techniques • Robust extensions 5. Next things to do…

  28. Thank You!

More Related