260 likes | 270 Views
Micro-Fusion. Presented by. Marco Di Zio Istat – Italian National institute of Statistics. Outline. Micro-Fusion in Memobust Objectives characterising Micro-Fusion settings Focus on some methods Structure of the Memobust handbook section concerning micro-fusion. Micro-Fusion in Memobust.
E N D
Presented by • Marco Di Zio • Istat – Italian National institute of Statistics
Outline • Micro-Fusion in Memobust • Objectives characterising Micro-Fusion settings • Focus on some methods • Structure of the Memobust handbook section concerning micro-fusion
Micro-Fusion in Memobust • Micro Fusion in Memobust • Integration of data sources composed of units (input: micro) in order to obtain still a data set composed of units (output: micro). • It is focused on statistical techniques
Main settings of M-F in Memobust Integration of data sources composed of the same units (Record linkage-Object matching) Integration of sources composed of different units (Statistical matching) Make integrated data to be consistent (Microintegration)
Example of Integration with the same units – (Record linkage-object matching) • There is a register where main variables are observed, we want to integrate with information from administrative sources and sample surveys • Register of businesses with main characteristics: NUTS, NACE, n. employee,.. • Financial statements and the Tax Authority sources • Small and medium enterprise survey
Frameworks for record linkage • Unique unit identifier without error • unit identifier should be created by the available variables, without error • unit identifier should be created by the available variables, affected by errors
The fellegi-sunter decision rule • Data sources A and B (NA and NB obs) • Choose k match vars (common) X1,…,Xk • Compare (e.g. ci=1 if Xi in A eq Xi in B, or ci=0 otherwise) and obtain C=(C1,…,Ck) for couple of units (a,b)
Fellegi-Sunter (1969) Compute Couples (a,b) can be ordered and classified in M* and U* (or undefined Q*) sets according to r
Fellegi-Sunter The thresholds are assigned solving equations that minimize both the size of the set Q and the false match rate and false non-match rate
Modules for record linkage-object matching in the handbook • Object matching (record linkage) • Object identifier matching • Unweighted matching of object characteristics • Weighted matching of object characteristics • Probabilistic record linkage • Fellegi-Sunter method for record linkage
Example of integration of ds with different units – (Statistical matching) • Combining Farms • Farm Structure Survey • Farm Accountancy Data survey • Combining Income - Consumption • Household Budget Survey • Bank of Italy survey on Income
Statistical matching methods • Imputation methods • Parametric methods • Non-parametric methods (donor imputation) • Mixed methods
Mixed methods • Estimate a parametric model (e.g. regression) • Use the model in step 1 to predict values in both the data sets (e.g. recipient A, donor B) • Use predicted values for finding a donor to impute in the recipient A (e.g. find the nearest neighbour in B according to a distance computed on regressed values)
Limitations and alternatives in SM • Naïve methods are implicitly based on the conditional independence assumption (Y and Z are independent given common vars X). • To overcome 1, use auxiliary information on Y and Z, e.g., outdated data, proxy variables… • computation of uncertainty bounds, i.e., the bounds of unidentifiable parameters (e.g., correlation of Y and Z).
Modules for statistical matching in the handbook • Matching different observations from different sources (statistical matching) • Statistical matching methods
Obtaining consistent data - Example Key vars on reliable admin data (e.g., Turnover, n. Employees, tot wages paid - Wages). The SBS requires more detail so -> A sample survey is conducted to obtain additional details. For Turnover and other key vars, the register values are used and survey values for the other variables.
Obtaining consistent data - Example • Business records have to adhere to a number of accounting rules and logical constraints, e.g • e1: x1 – x5 + x8 = 0 (Profit = Turnover – Total Costs) • e2: –x3 + x5 – x4 = 0 (Turnover = Turnover main + Turnover other) • e3: –x6 – x7 + x8 = 0 (Total Costs = Wages + Other costs)
Integration of data sources with different units - Example and… • violation of the edit-rules • to obtain a consistent record some of the values have to be changed or “adjusted''.
Adjusting methods • Prorating, • Minimum adjustment methods • Generalised ratio adjustment.
Minimum adjustment methods • e1: x1 – x5 + x8 = 0 (Profit = Turnover – Total Costs) • e2: –x3 + x5 – x4 = 0 (Turnover = Turnover main + Turnover other) • e3: –x6 – x7 + x8 = 0 (Total Costs = Wages + Other costs) can be expressed in the form Ex = c with
Minimum adjustment methods • More in general edits can be expressed as Ax >= b The min. adj. method consists in finding a solution for x0 :observed values of vars that can be modified
Microintegration modules in the handbook • Reconciling conflicting micro-data • Prorating • Minimum adjustment methods • Generalised ratio adjustments
Authors of the modules • Introductory module • Di Zio M. (Istat) • Modules on record linkage - object matching • Willneborg L., Van de Laar R. (CBS) • Tuoto T., Cibella N. (Istat) • Modules on statistical matching • Scanu M., D’Orazio M. (Istat) • Modules on microintegration • Pannekoek J. (CBS)