1 / 17

Outlier Treatment in HCSO

Outlier Treatment in HCSO. Present and future. Outline. Outlier detection – types, editing, estimation Description of the current method Alternatives Future work Introduction of a new tool: R and Rstudio. Outlier detection and treatment. Purpose of outlier detection. Estimation.

azure
Download Presentation

Outlier Treatment in HCSO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outlier Treatment in HCSO Present and future

  2. Outline Outlier detection – types, editing, estimation Description of the current method Alternatives Future work Introduction of a new tool: R and Rstudio UNECE Statistical Data Editing 2014

  3. Outlier detection and treatment Purpose of outlier detection Estimation Editing • Representative outliers • Non Representative outliers Identify errors • Decreasing weights • Changing the values • Using robust estimations Source: MEMOBUST UNECE Statistical Data Editing 2014

  4. Monthly Survey of Manufacturing Take-all part Survey part: less than 50 employees (and more than 5, because the smallest businesses are not in the scope of the survey). The sampling frame is based on the Register of Enterprises (~10 thousand units) The sampling ratio is about 15% Stratified sample (a lot of NACE categories, categories of the number of employees, and two territorial strata: the capital and everything else). (Telegdi 2004.) UNECE Statistical Data Editing 2014

  5. Monthly Survey of Manufacturing: data Distribution of some variables Skewed distribution Visible outliers UNECE Statistical Data Editing 2014

  6. Current method of outlier detection The aim of the outlier treatment is improving the estimation. (Csereháti 2004.) Steps of the method: Computing the outlier indicators Manual outlier detection by the methodologist/expert Transfer of the result to the subject matter statistician Discussion of the result by the subject matter statistician (possible modifications), resembles to the process of selective editing UNECE Statistical Data Editing 2014

  7. Outlierindicators LNSQRT: main indicator Grubbs crit. value Standardized value of the variables SQUARED: identifying highest values MEANX is the ratio of the observed value of the unit and the weighted mean of the stratum without this unit value. VALOUT indicator shows the difference between the estimation of the total with and without the given value in a given stratum. UNECE Statistical Data Editing 2014

  8. The main indicator: LNSQRT UNECE Statistical Data Editing 2014

  9. Outlier treatment Weight trimming: weights of the outliers are changed to 1 Number of outliers: avg. 2% of the cases Change in the estimates: Mean: -15% (in avarage) Variance: serious decrease UNECE Statistical Data Editing 2014

  10. Alternative methods • One dimensional methods • Median absolute deviation • Custom indicator: share in total • Quantile Disadvantage: applying to many variables • Multidimensional method: • Mahalanobis distance based outlier detection UNECE Statistical Data Editing 2014

  11. Share in total, a custom indicator • To consider the individual value and the size of the stratum in the same formula • inspired by the current indicators • The possible outlier: • shares a considerably great amount of the total • In a bigstratum • The indicator computed for each stratum UNECE Statistical Data Editing 2014

  12. Results • Quantile method • Threshold 99% • The method can identify almost the same outliers as the current one. • Easy to implement • MAD • Problem of the k (threshold) • Too many cases were selected UNECE Statistical Data Editing 2014

  13. Results (2) • Share in total • Threshold value: 0.5 • Smaller number of outliers • Mahalanobis distance • We used the robust Mahalanobis distance • 3 key variables (Total revenue etc.) • These are not involved in the current method • avoiding missing values • Similar results (2/3 of the current outliers are detected) UNECE Statistical Data Editing 2014

  14. UNECE Statistical Data Editing 2014

  15. Futureplans Development of methodology: More analysis of the effect on estimates Winsorization Development of the process Automation and reproducibility More informative report on the process, to help better understand and analyse the process steps UNECE Statistical Data Editing 2014

  16. Experimentaltools Outlier treatment is separated from other steps of data process, belongs to the methodology Possible new tool: R (with Rstudio) Advantage: ease of development Ready-to-use functions for outlier detection Disadvantage: need of „expert” user, not a usual tool UNECE Statistical Data Editing 2014

  17. Thank you for your attention! UNECE Statistical Data Editing 2014

More Related