300 likes | 311 Views
This seminar discusses the identification of exceptional values in the ESPON database through two case studies: detecting logical input errors and statistical outliers.
E N D
Paul Harris Martin Charlton National Centre for Geocomputation NUIM Maynooth Ireland Madrid seminar - 10/6/10 The identification of exceptional values in the ESPON database
ESPON DB data Identifying exceptional values Case study 1 (detecting logical input errors) Case study 2 (detecting statistical outliers) Next things to do.. Outline
Socio-economic, land cover,… • Continuous, categorical, nominal, ordinal,…. • Spatial support: • Area units – NUTS 0/1/2/23/3 • (whose boundaries may also change over time) • Temporal support: • Commonly, yearly units (with only a short time series) 1. ESPON DB data
Define two types: • Logical input errors • (e.g. a negative unemployment rate) • Statistical outliers • (e.g. an unusually high unemployment rate) • Two-stage identification algorithm: • Stage 1: identify input errors via mechanical techniques • Stage 2: identify outliers via statistical techniques 2. Identifying exceptional values
Stage 1: Identify logical Input Errors
Usually detected using some logical, mathematical approach • Statistical detection may also help… • Typical input errors: • Impossible values (e.g. negatives, fractions…) • Repeated data for different variables • Data displaced between or within columns • Data swapped between or within columns • Wrong NUTS code or name • Wrong NUTS regions used (e.g. for 1999 instead of 2006) • Missing value code (e.g. 9999 treated as a true value) • Etc. Logical input errors…
Detect input errors mathematically (& statistically) • Flag observations if they are likely input errors • If possible - correct them • More likely - consult an expert on the data • Once happy - go to stage 2 - assume data is error-free Our approach…
Stage 2: Identify statistical outliers
There is no single ‘best’ outlier detection technique, so… • Apply a representative selection of outlier detection techniques (which are simple & robust) • Flag an observation if it is a likely outlier according to each technique • Build up a weight of evidence for the likelihood of a given observation being statistically outlying • Suggest what type of outlier it is likely to be • - aspatial, spatial, temporal, relationship, some mixture… • Consult an expert on the data to decide on the appropriate course of action • Here’s an example using nine techniques & three observations… Our approach…
3. Case study 1 (detecting logical input errors) Data • Data at NUTS3 level (1351 observations/regions) • Variables: • GDP evolution (2000 to 2005) (%age) • Calculated using 4 other variables: • 205 logical input errors deliberately introduced to: • NUTS codes & the 4 variables used to calculate GDP evolution only • ~ 15% of data infected
Performance results False negatives - 13.2% (e.g. in Italy) False positives - 2.0% (e.g. in Spain) Overall misclassification rate - 3.7%
4. Case study 2 (detecting statistical outliers) Data • Data at NUTS23 level for eight years: 2000-2007 • For each year - ‘unemployment rate’ calculated [Unemployment population)/(Active population)] • 8 variables at each of 790 regions = 6320 obs. • Data checked for input errors - i.e. stage 1 done
Presentation of results… • For brevity… • Lets say - we only need at least one of 8 time-specific unemployment values in a region to be outlying… • (But we can identify outliers by year too)
Results: 7 PCA residuals(aspatial linear relationships & model-free)
Results: 8 LWPCA residuals(aspatial nonlinear relationships & model-free)
Results: 9 GWPCA residuals(spatial nonlinear relationships & model-free)
Preliminary performance results • Infected ~ 5% of the data with ‘outliers’ & repeated the analysis on this ‘infected’ data… • False negatives: 10.3% • False positives: 34.3% • Overall misclassification rate: 26.1% • Problems: • Difficult to guarantee that our infections actually produce outliers… • The data already contains outliers (as shown)
1. Other ways of performance testing our approach • Simulated data with known properties? • Statistical theory (or properties)? • 2. Refining each of our nine chosen techniques • Robust extensions 5. Next things to do…