70 likes | 292 Views
A statistical approach to surrogate data. Li-Chun Zhang Statistics Norway E-mail: lcz@ssb.no. A setting of surrogate data. Target data Directly collected, such as in sample surveys For a subset of population, if available Surrogate data
E N D
A statistical approach to surrogate data Li-Chun Zhang Statistics Norway E-mail: lcz@ssb.no
A setting of surrogate data • Target data • Directly collected, such as in sample surveys • For a subset of population, if available • Surrogate data • To replace target data for statistical purposes, hence “surrogate” • For reasons such as cost, burden, scope, etc. • Secondary data of nature: Re-use of data collected for other purposes • Typically from multiple sources • Often for the entire population or a major part of it • Two examples: • Register-based Employment statistics (surrogate) & LFS (target) • Self-administered census (surrogate) & post-census survey (target) • Issues of concern: • Conditions for valid substitution • Associated statistical accuracy
Unit-specific approach: Equality vs. equivalence • Unit-specific approach • Scheme • Link surrogate and target data at the micro level • Estimation of relevant unit-specific misclassification rates • Propagation of uncertainty to statistics of interest • Two shortcomings • Require micro-level linkage • Unit-specific consistency may be irrelevant or misleading for uses • Equality vs. equivalence – An example • Two binary data sets of the same size • Equal mean without the values being equal for all the units • Identical empirical CDF => identical statistical inference • Inequality may fail to reveal statistical equivalence
Other relevant situations • Some settings • Indirect (proxy) interview • Unstable reporting • Mode effects • Public micro data • Some observations • Unit-specific approach may not be applicable • Unit-specific equality may not even be desirable • To use surrogate data (Z) in place of target data (Y), together with additional data (X) • Joint distribution of (Z,Y|X) is not of primary interest for users • Distribution of (Z,X) instead of distribution of (Y,X) is the issue
Validity and equivalence • Valid surrogate data • Denote by f(x,y) and f(x,z) the distribution functions: f(X=x, Z=y) = f(X=x) f(Z=y | X=x) = f(X=x) f(Y=y | X=x) = f(X=x, Y=y) • Example: X = age-sex grouping, Z = register-employment status, Y = LFS-employment status according to ILO-definition • Example: Z = proxy-interview in LFS, Y = direct-interview in LFS • Equality of distribution can be assessed without linked / linkable data • Empirically equivalent surrogate data • Denote by p(x,z; s1) and p(x,y; s2) the empirical distribution functions: p(X=x; s1) p(Z=y | X=x; s1) = p(X=x; s2) p(Y=y | X=x, s2) • Equality on micro level not necessary & s1 may differ from s2 • Parametric analogy: Statistical equivalence by Sufficiency Principle
Similar ideas in disclosure control literature • Fienberg et al. (1998) • Random generation of “pseudo” micro data Z conditional on {x; s} • Parametric f(y | x) or empirical p(y | x) • Conditional validity in expectation, provided unbiased estimation • Rubin (1993) • Synthetic data & Bayesian multiple-imputation framework • Random generation of population data + sampling • No particular emphasis on conditioning & validity instead of equivalence • SARs • Sample of Anonymised Records from census data • Real data albeit anomymized • Valid surrogate data • Micro simulation • Based on sample instead of census data • Random generation of “imaginary” micro data • Validity in expectation provided unbiased estimation of distribution
Some applications / implications? • Statistics and inference based on surrogate data • Validity (or equivalence) vs. efficiency (or accuracy) • Example: Employment register (ER) vs. LFS • Deterministic ER-status by editing rules vs. valid ER-status for specific purposes • Bias of invalid ER-status vs. variance of valid ER-status • Balance in trade-off may change direction on more detailed levels • Micro data for public use • Targeting full empirical equivalence followed by disclosure control (DC) • Equivalent data targeting at coarsened information (embedded DC) • Micro calibration of surrogate data • Secondary population (U) data (X, Z1, Z2, …, Zk; U) & target sample data (X, Y1; s1), (X, Y2; s2), …, (X, Yk; sk) --- different units in general • Surrogate data (X*, Z1*, Z2*, …, Zk*) with marginal validity btw. (X; U) and (X*; U), (X*, Z1*; U) and (X, Y1; s1), …, (X*, Zk*; U) and (X, Yk; sk) • Conditional surrogate data (X, Z1*, Z2*, …, Zk*) with marginal validity btw. (X, Z1*; U) and (X, Y1; s1), …, (X, Zk*; U) and (X, Yk; sk)? • Alternative to statistical matching by Conditional Independence Assumption • Uncertainty?