Improved Variance Estimation for Fully Synthetic Datasets

Improved Variance Estimation for Fully Synthetic Datasets UNECE Work Session on Statistical Data Confidentiality 27. October 2011, Tarragona Jörg Drechsler Institute for Employment Research

Fully synthetic datasets • Originally proposed by Rubin (1993) • Closely related to the idea of multiple imputation for nonresponse • All values of the original dataset are replaced by synthetic values • Offer a very high level of data protection • Attractive for very sensitive data such as healthcare data

Fully synthetic datasets in theory X Ynot observed Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetisch Ysynthetic Yobserved

Fully synthetic datasets in practice • Based on the original design, the synthetic populations consist of a large number of synthetic records and a small number of original records. • There is a small chance that the released samples from these populations also contain original records. • Main advantage of fully synthetic datasets is lost • In practice, intermediate step of generating populations is omitted • Synthetic samples are generated directly • All records are synthetic

Combining rules for fully synthetic datasets • Raghunathan et al. (2003) developed the combining rules necessary to obtain valid inferences from fully synthetic datasets • Let be the point estimate obtained from dataset • Let be the estimated variance of • The following quantities are needed for inference

Combining rules for fully synthetic datasets • Final point estimate • Final variance estimate • Two major disadvantages: • Variance estimate strictly valid only for the original synthesis design • Variance estimate can be negative • Reiter (2003) suggested an adjusted variance estimate that is always positive but conservative

Alternative variance estimate • Closely related to the variance estimate for partially synthetic datasets • Only need to adjust for the potentially different sample sizes between the original sample and the synthetic sample where is the finite population correction factor for the original sample • Advantages • Can never be negative • Valid even if all records are synthesized • Disadvantages: • Only valid for - consistent estimators • Only valid under simple random sampling

Illustrative simulations • Repeated simulation design • One standard normal variable • Population size N=10,000 • Repeatedly draw SRS of different sizes (1%, 5%, 10%, 20%) • Generate two versions of synthetic data with nsyn=2norgand m=5,20,100 • Based on original synthesis design (RRR approach) • Synthesizing all records directly (practical approach) • Quantity of interest • Compute the variance estimates and under both synthesis designs • Replicate 5,000 times

Conclusions • Originally proposed variance estimate can be biased if all records are synthesized and the sampling rate is larger than 1%. • Alternative variance estimate • shows less variability than the original variance estimate • can never be negative • is always unbiased irrespective of the synthesis design • Alternative variance estimate is valid only • for –consistent estimates • under simple random sampling • Future work: Think about adjustments for complex sampling designs

Thank you for your attention

Improved Variance Estimation for Fully Synthetic Datasets

Improved Variance Estimation for Fully Synthetic Datasets

Presentation Transcript

Variance Estimation with Imputed Data

Improved Orbit Estimation Using GPS Measurements for Conjunction Analysis

Variance Estimation in Complex Surveys

Improved estimation of covariance matrix for portfolio optimization

Improved Estimation of Forestry Edge Effects Accounting for Detection Probability

Chapter 2 Minimum Variance Unbiased estimation

Technology for the production of fully synthetic aviation fuels, diesel and gasoline

NLS Estimation of the General Variance Model

Improved estimation of covariance matrix for portfolio optimization

Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR

Variance Estimation in EU-SILC Survey

Datasets

Replicate Variance Estimation and High Entropy Variance Approximation

Sigma for Variance

Variance Estimation

8.3 MINIMUM VARIANCE SPECTRUM ESTIMATION

Why need for improved estimation of line reliability?

Variance Estimation in Complex Surveys

Output Analysis: Variance Estimation

Section 7.5 Estimation of a Population Variance

SMOK Nord 2 Review- Fully Improved!