1 / 7

A statistical approach to surrogate data

A statistical approach to surrogate data. Li-Chun Zhang Statistics Norway E-mail: lcz@ssb.no. A setting of surrogate data. Target data Directly collected, such as in sample surveys For a subset of population, if available Surrogate data

adelle
Download Presentation

A statistical approach to surrogate data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A statistical approach to surrogate data Li-Chun Zhang Statistics Norway E-mail: lcz@ssb.no

  2. A setting of surrogate data • Target data • Directly collected, such as in sample surveys • For a subset of population, if available • Surrogate data • To replace target data for statistical purposes, hence “surrogate” • For reasons such as cost, burden, scope, etc. • Secondary data of nature: Re-use of data collected for other purposes • Typically from multiple sources • Often for the entire population or a major part of it • Two examples: • Register-based Employment statistics (surrogate) & LFS (target) • Self-administered census (surrogate) & post-census survey (target) • Issues of concern: • Conditions for valid substitution • Associated statistical accuracy

  3. Unit-specific approach: Equality vs. equivalence • Unit-specific approach • Scheme • Link surrogate and target data at the micro level • Estimation of relevant unit-specific misclassification rates • Propagation of uncertainty to statistics of interest • Two shortcomings • Require micro-level linkage • Unit-specific consistency may be irrelevant or misleading for uses • Equality vs. equivalence – An example • Two binary data sets of the same size • Equal mean without the values being equal for all the units • Identical empirical CDF => identical statistical inference • Inequality may fail to reveal statistical equivalence

  4. Other relevant situations • Some settings • Indirect (proxy) interview • Unstable reporting • Mode effects • Public micro data • Some observations • Unit-specific approach may not be applicable • Unit-specific equality may not even be desirable • To use surrogate data (Z) in place of target data (Y), together with additional data (X) • Joint distribution of (Z,Y|X) is not of primary interest for users • Distribution of (Z,X) instead of distribution of (Y,X) is the issue

  5. Validity and equivalence • Valid surrogate data • Denote by f(x,y) and f(x,z) the distribution functions: f(X=x, Z=y) = f(X=x) f(Z=y | X=x) = f(X=x) f(Y=y | X=x) = f(X=x, Y=y) • Example: X = age-sex grouping, Z = register-employment status, Y = LFS-employment status according to ILO-definition • Example: Z = proxy-interview in LFS, Y = direct-interview in LFS • Equality of distribution can be assessed without linked / linkable data • Empirically equivalent surrogate data • Denote by p(x,z; s1) and p(x,y; s2) the empirical distribution functions: p(X=x; s1) p(Z=y | X=x; s1) = p(X=x; s2) p(Y=y | X=x, s2) • Equality on micro level not necessary & s1 may differ from s2 • Parametric analogy: Statistical equivalence by Sufficiency Principle

  6. Similar ideas in disclosure control literature • Fienberg et al. (1998) • Random generation of “pseudo” micro data Z conditional on {x; s} • Parametric f(y | x) or empirical p(y | x) • Conditional validity in expectation, provided unbiased estimation • Rubin (1993) • Synthetic data & Bayesian multiple-imputation framework • Random generation of population data + sampling • No particular emphasis on conditioning & validity instead of equivalence • SARs • Sample of Anonymised Records from census data • Real data albeit anomymized • Valid surrogate data • Micro simulation • Based on sample instead of census data • Random generation of “imaginary” micro data • Validity in expectation provided unbiased estimation of distribution

  7. Some applications / implications? • Statistics and inference based on surrogate data • Validity (or equivalence) vs. efficiency (or accuracy) • Example: Employment register (ER) vs. LFS • Deterministic ER-status by editing rules vs. valid ER-status for specific purposes • Bias of invalid ER-status vs. variance of valid ER-status • Balance in trade-off may change direction on more detailed levels • Micro data for public use • Targeting full empirical equivalence followed by disclosure control (DC) • Equivalent data targeting at coarsened information (embedded DC) • Micro calibration of surrogate data • Secondary population (U) data (X, Z1, Z2, …, Zk; U) & target sample data (X, Y1; s1), (X, Y2; s2), …, (X, Yk; sk) --- different units in general • Surrogate data (X*, Z1*, Z2*, …, Zk*) with marginal validity btw. (X; U) and (X*; U), (X*, Z1*; U) and (X, Y1; s1), …, (X*, Zk*; U) and (X, Yk; sk) • Conditional surrogate data (X, Z1*, Z2*, …, Zk*) with marginal validity btw. (X, Z1*; U) and (X, Y1; s1), …, (X, Zk*; U) and (X, Yk; sk)? • Alternative to statistical matching by Conditional Independence Assumption • Uncertainty?

More Related