1 / 53

Deliverable 2.8: Outliers

Deliverable 2.8: Outliers. Gary Brown Office for National Statistics UK. Outliers = Outlier detection and treatment aspects of combining data (survey/administrative) including options for various hierarchies. Overview. Introduction Definitions Identification Treatment Recommendations.

elysia
Download Presentation

Deliverable 2.8: Outliers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK

  2. Outliers = Outlier detection and treatment aspects of combining data (survey/administrative) including options for various hierarchies

  3. Overview • Introduction • Definitions • Identification • Treatment • Recommendations

  4. Introduction • Deliverable 2.8 led by UK • UK leader worked in methodology over 14 years • Expert in Sample Design and Estimation for Business Surveys • ... also expert in Small Area Estimation, Quality, Editing and Imputation, Time Series Analysis • QA by Italy

  5. Definitions • Outliers • Errors • Outliers in survey data • Outliers in administrative data • Outliers in modelling • ... two glossaries considered: ONS and OECD

  6. Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values”

  7. Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”

  8. Definitions – outliers • OECD “A data value that lies in thetail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”

  9. Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate”

  10. Definitions – outliers • OECD “A data value that lies in the tail of the statistical distribution of a set of data values” • ONS “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate” • Question 1: extreme (1) influential (2) both (3)

  11. Definitions – errors • Errors are incorrect values identified by edit rules

  12. Definitions – errors • Errors are incorrect values identified by edit rules

  13. Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct”

  14. Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction”

  15. Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction”

  16. Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction” • Errors are corrected before outliers are considered

  17. Definitions – errors • Errors are incorrect values identified by edit rules • OECD “A logical condition or a restriction which must be met if the data is to be considered correct” • ONS “A rule designed to detect specific errors in data for potential subsequent correction” • Errors are corrected before outliers are considered • Question 2: outliers = errors (1) outliers ≠ errors (2)

  18. Definitions – survey outliers • In the survey context, an outlier is an unrepresentative value

  19. Definitions – survey outliers • In the survey context, an outlier is an unrepresentative value influential

  20. Definitions – survey outliers • In the survey context, an outlier is an unrepresentative value influential • A unit sampled with probability 1/n is assumed to represent n-1 unsampled units in the population • If the unit is unique, the assumption is invalid

  21. Definitions – administrative outliers • In the administrative context, an outlier is an atypical value

  22. Definitions – administrative outliers • In the administrative context, an outlier is an atypical value extreme

  23. Definitions – administrative outliers • In the administrative context, an outlier is an atypical value extreme • Administrative data represent a census, so each unit is treated as unique • No assumptions

  24. Definitions – modelling outliers • In the modelling context, an outlier is an influential value

  25. Definitions – modelling outliers • In the modelling context, an outlier is an influential value influential

  26. Definitions – modelling outliers • In the modelling context, an outlier is an influential value influential • ONS “The amount of effect a particular point has on the parameters of a regression equation” • Influence on processing and statistical modelling

  27. Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years”

  28. Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain”

  29. Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain” • Statistical modelling

  30. Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain” • Statistical modelling

  31. Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain” • Statistical modelling

  32. Definitions – modelling outliers • Processing – editing “fail if > 60% of maximum over past 5 years” • Processing – imputation “uplift last return by average growth in domain” • Statistical modelling

  33. Identification – units • A data warehouse stores data once for repeated use

  34. Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated

  35. Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated • Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted

  36. Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated • Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted every unit in a data warehouse is a potential outlier

  37. Identification – units • A data warehouse stores data once for repeated use • Each unit will have multiple values (variables/time periods), and whether any value is • extreme depends on which other data are used • influential depends on what process/model is estimated • Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted every unit in a data warehouse is a potential outlier • Question 3: yes (1) no (2) unsure (3)

  38. Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse

  39. Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods

  40. Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures

  41. Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures • updating the business register eg comparing different sources

  42. Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures • updating the business register eg comparing different sources • survey (estimating variables & calibration weights) eg winsorisation & setting acceptable ranges

  43. Identification – uses • Assuming all units are potential outliers • identification becomes use dependent • outliers are recorded as part of the metadata of an output • outliers are not otherwise recorded in the data warehouse • Expected data uses & egs of identification methods • processing eg comparing observed and expected edit failures • updating the business register eg comparing different sources • survey (estimating variables & calibration weights) eg winsorisation & setting acceptable ranges • survey/admin (modelling relationship & estimating survey) eg Cook’s distance & winsorisation

  44. Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE)

  45. Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods

  46. Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means

  47. Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means • updating the business register eg delete one source

  48. Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means • updating the business register eg delete one source • survey (estimating variables & calibration weights) eg winsorisation & restrict to acceptable ranges

  49. Treatment – units in uses • Identified outliers need to be treated during use • to prevent distortion • by adjusting the weight of the unit to 0 < P < 100% • balancing reducing variance and increasing bias (ie MSE) • Expected data uses & egs of treatment methods • processing eg use medians rather than means • updating the business register eg delete one source • survey (estimating variables & calibration weights) eg winsorisation & restrict to acceptable ranges • survey/admin (modelling relationship & estimating survey) eg delete from modelling process & winsorisation

  50. Recommendations • Neither data units nor their entries in a data warehouse should be labelled as outliers

More Related