Marcello D’Orazio ( madorazi@istat.it ) UNECE - Work Session on Statistical Data Editing

Statistical Matching and Imputation of Survey Data with the Package “Statmatch” for the Environment Marcello D’Orazio (madorazi@istat.it) UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011

UNECE Work Session on Statistical Data Editing What is Statistical Matching? Statistical Matching (data fusion o synthetic matching) consists in a series of methods to integrate two or more data sources referred to the same target population. Basic SM framework: • X variables are in common • Y and Z are NOT jointly observed • The chance of observing the same unit in A and B is close to zero Ljubljana, 9-11 May 2011

UNECE Work Session on Statistical Data Editing Objectives of Statistical Matching • micro: derive a “synthetic” data-set with X, Y and Z • macro: estimation of parameters: correlation coef. ( ) or frequencies Ljubljana, 9-11 May 2011

UNECE Work Session on Statistical Data Editing The package StatMatch for the R environment “StatMatch” provides R functions to apply some Statistical Matching methods Generalization and optimization of the code provided with the monograph about SM by D’Orazio et al. (2006). The first version of StatMatch (version 0.4) released on CRAN (Comprehensive R Archive Network) in 2008. In the beginning of 2011 the version 1.0.1 has been released; this version present a significant improvement of the functionalities of the previous version (0.8 released in 2009). http://cran.at.r-project.org/web/packages/StatMatch/index.html Package available for: MS Windows (32 and 64 bit), Linux, Mac Ljubljana, 9-11 May 2011

UNECE Work Session on Statistical Data Editing Functions in StatMatch • Five main groups of functions: • functions to perform nonparametric SM at micro level by means of hot deck imputation (NND.hotdeck, RANDwNND.hotdeck, rankNND.hotdeck); • a function to perform mixed SM at macro or micro level for continuous variables (mixed.mtc); • functions to integrate data from complex sample surveys through calibration of weights as proposed by Renssen (1998) (harmonize.x and comb.samples); • functions to explore uncertainty on the contingency table YxZ (Frechet.bounds.cat and Fbwidhts.by.x); • other functions to compute distances (gower.dist and maximum.dist), to create the synthetic data set (create.fused), etc. Ljubljana, 9-11 May 2011

UNECE Work Session on Statistical Data Editing SM via hot deck imputation NND.hotdeck()nearest neighbour distance hot deck: - many distance functions - imputation classes - constrained or unconstrained RANDwNND.hotdeck()random hot deck and some variants - random hot deck in classes - random hot deck in “moving” classes - it is possible to use weights rankNND.hotdeck()nearest neighbour with distance computed on the percentage points of the empirical cumulative distribution function of X Ljubljana, 9-11 May 2011

UNECE Work Session on Statistical Data Editing Mixed SM methods • mixed.mtc()mixed SM methods for continuous • variables: • consist in two steps: • fits regression models (regression) Y vs. X and Z vs. X • fills A with units chosen by means of constrained distance hot deck computed on intermediate and live values of Y and Z • - two methods to estimate regression • parameters: (ML and Moriarity&Scheuren, • 2001) • - possibility of introducing auxiliary information about the correlation coef. • between Y and Z Ljubljana, 9-11 May 2011

UNECE Work Session on Statistical Data Editing SM of data from complex sample surveys Renssen’s (1998) approach based on a series of calibration steps of the survey weights of A and B, and if available C (C may contain Y and Z or X, Y and Z) harmonize.x()harmonizes the joint/marginal distribution of X variables in A and B comb.samples()estimates the contingency table Y vs. Z using available auxiliary information in C (when available): - Conditional Independence Assum. - incomplete two way stratification - synthetic two way stratification Ljubljana, 9-11 May 2011

UNECE Work Session on Statistical Data Editing Exploring uncertainty due to SM basic framework Frechet.bounds.cat() to derive the uncertainty bounds for frequencies in the contingency table Y vs. Z, starting from the marginal tables X vs. Y and X vs. Z Fbwidths.by.x()explores how the various possible subsets of the X variables contribute in reducing the uncertainty on the cells of Y vs. Z Ljubljana, 9-11 May 2011

UNECE Work Session on Statistical Data Editing Computational Efficiency • All the functions in StatMatch are based on R code and there are no calls to other external code (compiled C or Fortran): • “Interpreted languages (Matlab, R, Python, Lisp) are fun ... but slow. • Compiled languages (machine code, assembly, FORTRAN, C, Java) are fast… but are work (= no fun)” Mizera (2006) Artificial data: A contains 14,000 obs.; about 54,000 obs. in B. PC with CPU Pentium IV 3GHz, 3GB RAM, MS Windows XP Prof. (SP 3; 32bit) Ljubljana, 9-11 May 2011

Warning! • “Although abusing R was not proved to be addictive, • it should be noted that it often leads to harder stuff” • Mizera (2006) • Thank You for Your attention! Ljubljana, 9-11 May 2011

UNECE Work Session on Statistical Data Editing Some References D'Orazio, M. (2009). StatMatch: Statistical Matching. R package version 1.0.1. http://CRAN.R-project.org/package=StatMatch D’Orazio, M., Di Zio, M., and Scanu, M. (2006) Statistical Matching: Theory and Practice. Wiley and Sons, Chichester. Mizera, I. (2006) “Graphical Exploratory Analysis Using Halfspace Depth”. Presentation at “useR!, The R User Conference 2006”, Wien, 15-17 June 2006. Moriarity C., Scheuren F. (2001) “Statistical matching: a paradigm for assessing the uncertainty in the procedure”. Journal of Official Statistics, 17, 407–422. Renssen, R.H. (1998) “Use of Statistical matching techniques in calibration estimation” Survey Methodology, 24, pp. 171-183. Ljubljana, 9-11 May 2011

Marcello D’Orazio ( madorazi@istat.it ) UNECE - Work Session on Statistical Data Editing