130 likes | 148 Views
Improved Variance Estimation for Fully Synthetic Datasets. UNECE Work Session on Statistical Data Confidentiality 27. October 2011, Tarragona. Jörg Drechsler Institute for Employment Research. Fully synthetic datasets. Originally proposed by Rubin (1993)
E N D
Improved Variance Estimation for Fully Synthetic Datasets UNECE Work Session on Statistical Data Confidentiality 27. October 2011, Tarragona Jörg Drechsler Institute for Employment Research
Fully synthetic datasets • Originally proposed by Rubin (1993) • Closely related to the idea of multiple imputation for nonresponse • All values of the original dataset are replaced by synthetic values • Offer a very high level of data protection • Attractive for very sensitive data such as healthcare data
Fully synthetic datasets in theory X Ynot observed Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetic Yobserved
Fully synthetic datasets in practice • Based on the original design, the synthetic populations consist of a large number of synthetic records and a small number of original records. • There is a small chance that the released samples from these populations also contain original records. • Main advantage of fully synthetic datasets is lost • In practice, intermediate step of generating populations is omitted • Synthetic samples are generated directly • All records are synthetic
Combining rules for fully synthetic datasets • Raghunathan et al. (2003) developed the combining rules necessary to obtain valid inferences from fully synthetic datasets • Let be the point estimate obtained from dataset • Let be the estimated variance of • The following quantities are needed for inference
Combining rules for fully synthetic datasets • Final point estimate • Final variance estimate • Two major disadvantages: • Variance estimate strictly valid only for the original synthesis design • Variance estimate can be negative • Reiter (2003) suggested an adjusted variance estimate that is always positive but conservative
Alternative variance estimate • Closely related to the variance estimate for partially synthetic datasets • Only need to adjust for the potentially different sample sizes between the original sample and the synthetic sample where is the finite population correction factor for the original sample • Advantages • Can never be negative • Valid even if all records are synthesized • Disadvantages: • Only valid for - consistent estimators • Only valid under simple random sampling
Illustrative simulations • Repeated simulation design • One standard normal variable • Population size N=10,000 • Repeatedly draw SRS of different sizes (1%, 5%, 10%, 20%) • Generate two versions of synthetic data with nsyn=2norgand m=5,20,100 • Based on original synthesis design (RRR approach) • Synthesizing all records directly (practical approach) • Quantity of interest • Compute the variance estimates and under both synthesis designs • Replicate 5,000 times
Conclusions • Originally proposed variance estimate can be biased if all records are synthesized and the sampling rate is larger than 1%. • Alternative variance estimate • shows less variability than the original variance estimate • can never be negative • is always unbiased irrespective of the synthesis design • Alternative variance estimate is valid only • for –consistent estimates • under simple random sampling • Future work: Think about adjustments for complex sampling designs