530 likes | 710 Views
Deliverable 2.8: Outliers. Gary Brown Office for National Statistics UK. Outliers = Outlier detection and treatment aspects of combining data (survey/administrative) including options for various hierarchies. Overview. Introduction Definitions Identification Treatment Recommendations.
E N D
Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK
Outliers = Outlier detection and treatment aspects of combining data (survey/administrative) including options for various hierarchies
Overview • Introduction • Definitions • Identification • Treatment • Recommendations
Introduction • Deliverable 2.8 led by UK • UK leader worked in methodology over 14 years • Expert in Sample Design and Estimation for Business Surveys • ... also expert in Small Area Estimation, Quality, Editing and Imputation, Time Series Analysis • QA by Italy
Definitions • Outliers • Errors • Outliers in survey data • Outliers in administrative data • Outliers in modelling • ... two glossaries considered: ONS and OECD
Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values”
Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”
Definitions – outliers • OECD “A data value that lies in thetail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”
Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”
Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate” • Question 1: extreme (1) influential (2) both (3)
Definitions – errors • Errors are incorrect values identified by edit rules
Definitions – errors • Errors are incorrect values identified by edit rules
Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct”
Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction”
Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction”
Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction” • Errors are corrected before outliers are considered
Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction” • Errors are corrected before outliers are considered • Question 2: outliers = errors (1) outliers ≠ errors (2)
Definitions – survey outliers • In the survey context, an outlier is an unrepresentative value
Definitions – survey outliers • In the survey context, an outlier is an unrepresentative value influential
Definitions – survey outliers • In the survey context, an outlier is an unrepresentative value influential • A unit sampled with probability 1/n is assumed to represent n-1 unsampled units in the population • If the unit is unique, the assumption is invalid
Definitions – administrative outliers • In the administrative context, an outlier is an atypical value
Definitions – administrative outliers • In the administrative context, an outlier is an atypical value extreme
Definitions – administrative outliers • In the administrative context, an outlier is an atypical value extreme • Administrative data represent a census, so each unit is treated as unique • No assumptions
Definitions – modelling outliers • In the modelling context, an outlier is an influential value
Definitions – modelling outliers • In the modelling context, an outlier is an influential value influential
Definitions – modelling outliers • In the modelling context, an outlier is an influential value influential • ONS “The amount of effect a particular point has on the parameters of a regression equation” • Influence on processing and statistical modelling
Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years”
Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain”
Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain” • Statistical modelling
Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain” • Statistical modelling
Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain” • Statistical modelling
Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain” • Statistical modelling
Identification – units • A data warehouse stores data once for repeated use
Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated
Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated • Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted
Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated • Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted every unit in a data warehouse is a potential outlier
Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated • Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted every unit in a data warehouse is a potential outlier • Question 3: yes (1) no (2) unsure (3)
Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse
Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods
Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures
Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures • updating the business register eg comparing different sources
Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures • updating the business register eg comparing different sources • survey (estimating variables & calibration weights) eg winsorisation & setting acceptable ranges
Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures • updating the business register eg comparing different sources • survey (estimating variables & calibration weights) eg winsorisation & setting acceptable ranges • survey/admin (modelling relationship & estimating survey) eg Cook’s distance & winsorisation
Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE)
Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods
Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means
Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means • updating the business register eg delete one source
Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means • updating the business register eg delete one source • survey (estimating variables & calibration weights) eg winsorisation & restrict to acceptable ranges
Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means • updating the business register eg delete one source • survey (estimating variables & calibration weights) eg winsorisation & restrict to acceptable ranges • survey/admin (modelling relationship & estimating survey) eg delete from modelling process & winsorisation
Recommendations • Neither data units nor their entries in a data warehouse should be labelled as outliers