European Conference on Quality in Official Statistics 2008 Rome, 08.-11. July 2008

Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control in the German IAB Establishment Panel Jörg Drechsler, Stefan Bender (Institute for Employment Research, Germany)& Susanne Rässler (University of Bamberg) European Conference on Quality in Official Statistics 2008 Rome, 08.-11. July 2008

Overview • Multiple Imputation for Statistical Disclosure Control • The IAB Establishment Panel • Application of The Two Approaches • Comparison of The Results • Conclusion

Fully synthetic data sets (Rubin 1993) X • advantages: - data are fully synthetic • - re-identification of single units almost impossible - all variables are still fully available • disadvantages: - strong dependence on the imputation model • - setting up a model might be difficult/impossible Ynot observed Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetic Yobserved

Partially synthetic data sets (Little 1993) • only potentially identifying or sensitive variables are replaced

Partially synthetic data sets (Little 1993) • only potentially identifying or sensitive variables are replaced • advantages: - model dependence decreases • - models are easier to set up • disadvantages: - true values remain in the data set • - disclosure might still be possible

Overview • Multiple Imputation for Statistical Disclosure Control • The IAB Establishment Panel • Application of The Two Approaches • Comparison of the Results • Conclusions

The IAB Establishment Panel • Annually conducted Establishment Survey • Since 1993 in Western Germany, since 1996 in Eastern Germany • Population: All establishments with at least one employee covered by social security • Source: Official Employment Statistics • Response rate of repeatedly interviewed establishments more than 80% • Sample of more than 16.000 establishments in the last wave • Contents: employment structure, changes in employment, business policies, investment, training, remuneration, working hours, collective wage agreements, works councils

Overview • Multiple Imputation for Statistical Disclosure Control • The IAB Establishment Panel • Application of the Two Approaches • Comparison of the Results • Conclusions

Generating fully synthetic data sets for the IAB Establishment Panel • Create a synthetic data set for selected variables from the wave 1997 from the Establishment Panel • Draw 10 new sample from the Official Employment Statistics using the same sampling design as for the Establishment Panel (Stratification by industry, size, and region) • The number of observations in each sample equals the number of observations in the panel ns=np=7332 • Every sample is imputed ten times using sequential regression • Number of variables from the establishment panel: 48 • Imputations are generated using IVEware by Raghunathan, Solenberger and Hoewyk (2001)

Imputation procedure for partially synthetic data • Only two variables are synthesized: - number of employees • - industry (16 categories) • Same variables for the imputation models • Imputation by sequential regression • Imputation model: - multinomial logit for the industry • - linear model for the cubic root of the nb of employees • - 4 independent linear models defined by quartiles for the establishment size • Imputations based on own coding in R.

Overview • Multiple Imputation for Statistical Disclosure Control • The IAB Establishment Panel • Application of The Two Approaches • Comparison of the Results • Conclusion

Analytical validity • Compare regression results from the original data with results from the synthetic data • First regression: • Zwick (2005) analyses the productivity effects of different continuing vocational training forms in Germany • Probit regression to explain, why firms offer vocational training • 13 Explanatory variables including: Share of qualified employees, establishment size, industry, collective wage agreement, high qualification needs expected… • Second regression: • Log(number of employees) on 15 industry dummies • Two data utility measures: • - Comparison of the beta coefficients from the original data set and the synthetic data sets • - confidence interval overlap

Suggested by Karr et al. (2006) Measure the overlap of CIs from the original data and CIs from the synthetic data The higher the overlap, the higher the data utility Compute the average relative CI overlap for any Confidence interval overlap CI for the synthetic data CI for the original data

Significant at the 0,1 % level Significant at the 1 % level Significant at the 5 % level Results from the first regression (Zwick 2005)

Average confidence interval (CI) overlap for the estimates from the first regression 0,808 0,926 Average overlap

Results from the second regression (log(nb. of employees) on industry) = Significant at the 0,1 % level = Significant at the 1 % level = Significant at the 5 % level = insignificant

Average confidence interval (CI) overlap for the estimates from the second regression 0,699 0,839 Average overlap

Disclosure risk • Difficult to compare between partially and fully synthetic data sets • Disclosure risk is low for fully synthetic data sets, although not zero • DR is higher for partially synthetic data sets, because: • True values remain in the data set • Only survey respondents are included • For partially synthetic data sets a careful disclosure risk evaluation is necessary

Overview • Multiple Imputation for Statistical Disclosure Control • The IAB Establishment Panel • Application of The Two Approaches • Comparison of the Results • Conclusions

Conclusions • Generating synthetic data sets can be a useful method for SDC • Advantages for partially synthetic data sets: • Higher data validity • Imputation models easier to set up • Lower risk of biased imputations • Disadvantages for partially synthetic data sets: • Higher risk of disclosure • Careful disclosure risk evaluation necessary • Agencies will have to decide depending on the complexity of the survey and the expected risk of disclosure

Thank you for your attention

European Conference on Quality in Official Statistics 2008 Rome, 08.-11. July 2008

European Conference on Quality in Official Statistics 2008 Rome, 08.-11. July 2008

Presentation Transcript

Conference on Climate Change, Development and Official Statistics Seoul, 11-12 December 2008

Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

European Conference on Quality in Official Statistics (Q2014 ) Vienna, 5 June 2014

European Conference on Quality in Official Statistics (Q2014)

The European Conference on Quality in Official Statistics Rome, 8-11 July 2008

European Conference on Quality in Official Statistics 2008 July 8 – 11, 2008

Q2014-European Conference on Quality in Official Statistics, Vienna, 2-5 June 2014

Q 2010 European Conference on Quality in Official Statistics Helsinki, May 2010

European Conference on Quality in Official Statistics Session on Quality reporting

European Conference on Quality 2008 in Official Statistics Session on Administrative data.

Q2008 Conference Rome 2008

European Conference on Quality in Official Statistics, Q2010

11 July 2008

Q2008 – European Conference on Quality in Official Statistics Rome • July 9-11, 2008

European Conference on Quality in Official Statistics (Q2014)

Q2010 EUROPEAN CONFERENCE ON QUALITY IN STATISTICS

European conference on quality in official statistics Rome, 8-10 July 2008

European Conference on Quality in Official Statistics Session 26: Quality Issues in Census

European Conference on Quality in Survey Statistics

IAOS Conference on Reshaping Official Statistics Shanghai, 14-16 October 2008

IAOS Conference on Reshaping Official Statistics Shanghai, 14-16 October 2008