Deliverable 2.8: Outliers

Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK

Outliers = Outlier detection and treatment aspects of combining data (survey/administrative) including options for various hierarchies

Overview • Introduction • Definitions • Identification • Treatment • Recommendations

Introduction • Deliverable 2.8 led by UK • UK leader worked in methodology over 14 years • Expert in Sample Design and Estimation for Business Surveys • ... also expert in Small Area Estimation, Quality, Editing and Imputation, Time Series Analysis • QA by Italy

Definitions • Outliers • Errors • Outliers in survey data • Outliers in administrative data • Outliers in modelling • ... two glossaries considered: ONS and OECD

Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values”

Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”

Definitions – outliers • OECD “A data value that lies in thetail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”

Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”

Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate” • Question 1: extreme (1) influential (2) both (3)

Definitions – errors • Errors are incorrect values identified by edit rules

Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct”

Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction”

Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction” • Errors are corrected before outliers are considered

Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction” • Errors are corrected before outliers are considered • Question 2: outliers = errors (1) outliers ≠ errors (2)

Definitions – survey outliers • In the survey context, an outlier is an unrepresentative value

Definitions – survey outliers • In the survey context, an outlier is an unrepresentative value influential

Definitions – survey outliers • In the survey context, an outlier is an unrepresentative value influential • A unit sampled with probability 1/n is assumed to represent n-1 unsampled units in the population • If the unit is unique, the assumption is invalid

Definitions – administrative outliers • In the administrative context, an outlier is an atypical value

Definitions – administrative outliers • In the administrative context, an outlier is an atypical value extreme

Definitions – administrative outliers • In the administrative context, an outlier is an atypical value extreme • Administrative data represent a census, so each unit is treated as unique • No assumptions

Definitions – modelling outliers • In the modelling context, an outlier is an influential value

Definitions – modelling outliers • In the modelling context, an outlier is an influential value influential

Definitions – modelling outliers • In the modelling context, an outlier is an influential value influential • ONS “The amount of effect a particular point has on the parameters of a regression equation” • Influence on processing and statistical modelling

Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years”

Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain”

Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain” • Statistical modelling

Identification – units • A data warehouse stores data once for repeated use

Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated

Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated • Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted

Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated • Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted every unit in a data warehouse is a potential outlier

Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated • Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted every unit in a data warehouse is a potential outlier • Question 3: yes (1) no (2) unsure (3)

Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse

Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods

Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures

Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures • updating the business register eg comparing different sources

Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures • updating the business register eg comparing different sources • survey (estimating variables & calibration weights) eg winsorisation & setting acceptable ranges

Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures • updating the business register eg comparing different sources • survey (estimating variables & calibration weights) eg winsorisation & setting acceptable ranges • survey/admin (modelling relationship & estimating survey) eg Cook’s distance & winsorisation

Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE)

Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods

Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means

Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means • updating the business register eg delete one source

Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means • updating the business register eg delete one source • survey (estimating variables & calibration weights) eg winsorisation & restrict to acceptable ranges

Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means • updating the business register eg delete one source • survey (estimating variables & calibration weights) eg winsorisation & restrict to acceptable ranges • survey/admin (modelling relationship & estimating survey) eg delete from modelling process & winsorisation

Recommendations • Neither data units nor their entries in a data warehouse should be labelled as outliers

Deliverable 2.8: Outliers

Deliverable 2.8: Outliers

Presentation Transcript

Are There Any Outliers?

Applied Math Notes

Lecture 8

AP Stat Essential Stuff

Detecting Outliers

Outliers, Chapter 1 by Malcom Gladwell

Nature and Wellness Project Extension

Multi- variate Outliers in Data Cubes

Outliers : The Story Of Success Malcolm Gladwell

1.2 Describing Distributions with Numbers

On Community Outliers and their Efficient Detection in Information Networks

Detecting Outliers

Outliers : The Story Of Success Malcolm Gladwell

ETA Deliverable Processing in OP

Statistics and Outliers

Term Paper DETECTING OUTLIERS

Contribution – WP1

Outliers with „natural limits“

D2.1 Deliverable

Data Cleaning and spotting outliers with UNIVARIATE