1 / 24

Jörg Drechsler Competence Center for Empirical Methods

A New Approach for Disclosure Control in the IAB Establishment Panel – Multiple Imputation for a Better Data Access. Jörg Drechsler Competence Center for Empirical Methods Institute for Employment Research of the Federal Employment Agency, Germany

maddy
Download Presentation

Jörg Drechsler Competence Center for Empirical Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Approach for Disclosure Control in the IAB Establishment Panel – Multiple Imputation for a Better Data Access Jörg Drechsler Competence Center for Empirical Methods Institute for Employment Research of the Federal Employment Agency, Germany UNECE Work Session on Statistical Data Editing Bonn 25.09.2006-27.09.2006

  2. Overview • The IAB Establishment Panel • Three approaches for disclosure control via multiple imputation • Application of the full MI approach to the IAB Establishment Panel • First results • Proceedings/open questions

  3. The IAB Establishment Panel • Annually conducted Establishment Survey (generally face-to-face interviews) • Since 1993 in Western Germany, since 1996 in Eastern Germany • Population: All establishments with at least one employee covered by social security • Source: Official Employment Statistics • Response rate of repeatedly interviewed establishments more than 80%

  4. The IAB Establishment Panel: Sample/Weighting • Sample of more than 16.000 establishments in the last wave • Stratified sample:20 economic branches x 10 size classes • Oversampling of large establishments • Yearly additional samples:newly founded firms and replacements for panel attrition • Weighting:- inverse sampling probabilities- adjustment to exogenous values- probabilities to stay in the sample

  5. The IAB Establishment Panel: Contents • Annual: employment structure, changes in employment, business policies, investment, training, remuneration, working hours, collective wage agreements, works councils • Bi- or triennial: innovations, government aid, further training, flexibility of working hours, business activities, contact with employment offices • Focus: 2001 innovation and modern technologies 2002 elderly employees and contact to the labour offices • Kölling, A. (2000): The IAB-Establishment Panel, Journal of Appl. Social Science Studies, 120: 2, 291-300.

  6. Overview • The IAB Establishment Panel • Three approaches for disclosure control via multiple imputation • Application of the full MI approach to the IAB Establishment Panel • First results • Proceedings/open questions

  7. X Yexc Yinc Fully Synthetic Data • Proposed by Rubin (1993) • Idea: - Treat all the units from the population not included in the sample as missing data and impute them multiply - Take random samples from the imputed population and release these samples to the public. X variables available for all units in the population Y variables available only for units in the survey Yincunits included in the survey Yexcunits not included in the survey

  8. Imputation of Selected Variables • Only for variables that bear a high risk of disclosure (key variables) observed values are replaced by imputed values • Proposal: Replace only parts of each key variable in every imputation round and combine the imputed parts to achieve fully imputed variables. • Example: 3 variables and 3 imputation rounds

  9. Selective Multiple Imputation of Key Variables (SMIKe) • Suggested by Liu and Little (2002) • Only selected units of key variables are multiply imputed • Assume, the dataset can be divided in a set of categorical key variables X and a set of continuous variables Y • Cross tabulation of X yields the vector x containing cell counts for all combinations of x • Cell counts lower than a previously defined sensitivity threshold possibly allow re-identification • These cells combined with some non sensitive cells, closely related to the sensitive cells in regard to Y, are replaced by imputed values

  10. Overview • The IAB Establishment Panel • Three approaches for disclosure control via multiple imputation • Application of the full MI approach to the IAB Establishment Panel • First results • Proceedings/open questions

  11. Generating a synthetic data set • Create a synthetic data set for selected variables from the wave 1997 from the Establishment Panel • Imputation for the whole population is not feasible • Draw a new sample from the Official Employment Statistics using the same sampling design as for the Establishment Panel (Stratification by economic branch, size, and region) • Each stratum cell contains the same number of observations as the wave 1997 from the Establishment Panel • Additional Information from the German Social Security Data (GSSD) for the imputation missing data X Yexc data from the new sample Yinc data from the IAB Establishment Panel

  12. The German Social Security Data (GSSD) • Contains information on all employees covered by social security • Since 1973 all employers are required to notify the social security agencies about all employees covered by social security. • The GSSD represents about 80% of the German workforce • Information from the GSSD is aggregated on the establishment level and is matched to the IAB Establishment Panel via establishment identification number • Information on: number of employees by gender, schooling, mean of the employees age, mean of the wages of the employees…

  13. Imputation procedure • For simplicity new founded establishments are excluded from the sampling frame and from the panel • 8 new samples are drawn • The number of observations in each sample equals the number of observations in the panel ns=np=7332 • Every sample is imputed five times using chained equations • Number of variables in X=24 • Number of variables in Y=48 • Imputations are generated using IVEware by Raghunathan, Solenberger and Hoewyk (2001)

  14. Overview • The IAB Establishment Panel • Three approaches for disclosure control via multiple imputation • Application of the full MI approach to the IAB Establishment Panel • First results • Proceedings/open questions

  15. A regression by T. Zwick (2005) as a means of evaluation • Zwick analyses the productivity effects of different continuing vocational training forms in Germany • Results: vocational training is one of the most important measures to gain and keep productivity • Probit regression to explain, why firms offer vocational training • 13 Explanatory variables including: Share of qualified employees, establishment size, region, collective wage agreement, high qualification needs expected… • 2 variables, based on the 1998 wave of the panel, are dropped for the evaluation

  16. Binary variables in the original and in the synthetic data set

  17. Continuous variables in the original and in the synthetic dataset

  18. Results from the regression

  19. Complete data set and synthetic data set

  20. Overview • The IAB Establishment Panel • Three approaches for disclosure control via multiple imputation • Application of the full MI approach to the IAB Establishment Panel • First results • Proceedings/open questions

  21. Proceedings/Open Questions • Use non parametric approaches • Replace only selected variables • Measure the disclosure risk after imputation • Generate weights for the synthetic sample?

  22. Thank you for the attention!

  23. Rubin’s adjusted combining rules • Imputation yields m different data sets • Information from the data sets has to be combined to get valid estimates Point Estimate: Average of the point estimates from the different data sets Variance estimate as a combination of the variance within the data sets (W) and the variance between the data sets (B) (not ) with Additional sampling step necessary, when creating synthetic data sets variance B already reflects the variance within each population

  24. Information from the two data sets

More Related