130 likes | 149 Views
Explore the theory and practice of fully synthetic datasets, combining rules for inferences, alternative variance estimation, and illustrative simulations for valid conclusions.
E N D
Improved Variance Estimation for Fully Synthetic Datasets UNECE Work Session on Statistical Data Confidentiality 27. October 2011, Tarragona Jörg Drechsler Institute for Employment Research
Fully synthetic datasets • Originally proposed by Rubin (1993) • Closely related to the idea of multiple imputation for nonresponse • All values of the original dataset are replaced by synthetic values • Offer a very high level of data protection • Attractive for very sensitive data such as healthcare data
Fully synthetic datasets in theory X Ynot observed Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetic Yobserved
Fully synthetic datasets in practice • Based on the original design, the synthetic populations consist of a large number of synthetic records and a small number of original records. • There is a small chance that the released samples from these populations also contain original records. • Main advantage of fully synthetic datasets is lost • In practice, intermediate step of generating populations is omitted • Synthetic samples are generated directly • All records are synthetic
Combining rules for fully synthetic datasets • Raghunathan et al. (2003) developed the combining rules necessary to obtain valid inferences from fully synthetic datasets • Let be the point estimate obtained from dataset • Let be the estimated variance of • The following quantities are needed for inference
Combining rules for fully synthetic datasets • Final point estimate • Final variance estimate • Two major disadvantages: • Variance estimate strictly valid only for the original synthesis design • Variance estimate can be negative • Reiter (2003) suggested an adjusted variance estimate that is always positive but conservative
Alternative variance estimate • Closely related to the variance estimate for partially synthetic datasets • Only need to adjust for the potentially different sample sizes between the original sample and the synthetic sample where is the finite population correction factor for the original sample • Advantages • Can never be negative • Valid even if all records are synthesized • Disadvantages: • Only valid for - consistent estimators • Only valid under simple random sampling
Illustrative simulations • Repeated simulation design • One standard normal variable • Population size N=10,000 • Repeatedly draw SRS of different sizes (1%, 5%, 10%, 20%) • Generate two versions of synthetic data with nsyn=2norgand m=5,20,100 • Based on original synthesis design (RRR approach) • Synthesizing all records directly (practical approach) • Quantity of interest • Compute the variance estimates and under both synthesis designs • Replicate 5,000 times
Conclusions • Originally proposed variance estimate can be biased if all records are synthesized and the sampling rate is larger than 1%. • Alternative variance estimate • shows less variability than the original variance estimate • can never be negative • is always unbiased irrespective of the synthesis design • Alternative variance estimate is valid only • for –consistent estimates • under simple random sampling • Future work: Think about adjustments for complex sampling designs