Micro-Fusion Techniques in Data Integration Handbook

Micro-Fusion

Presented by • Marco Di Zio • Istat – Italian National institute of Statistics

Outline • Micro-Fusion in Memobust • Objectives characterising Micro-Fusion settings • Focus on some methods • Structure of the Memobust handbook section concerning micro-fusion

Micro-Fusion in Memobust • Micro Fusion in Memobust • Integration of data sources composed of units (input: micro) in order to obtain still a data set composed of units (output: micro). • It is focused on statistical techniques

Main settings of M-F in Memobust Integration of data sources composed of the same units (Record linkage-Object matching) Integration of sources composed of different units (Statistical matching) Make integrated data to be consistent (Microintegration)

Example of Integration with the same units – (Record linkage-object matching) • There is a register where main variables are observed, we want to integrate with information from administrative sources and sample surveys • Register of businesses with main characteristics: NUTS, NACE, n. employee,.. • Financial statements and the Tax Authority sources • Small and medium enterprise survey

Frameworks for record linkage • Unique unit identifier without error • unit identifier should be created by the available variables, without error • unit identifier should be created by the available variables, affected by errors

The fellegi-sunter decision rule • Data sources A and B (NA and NB obs) • Choose k match vars (common) X1,…,Xk • Compare (e.g. ci=1 if Xi in A eq Xi in B, or ci=0 otherwise) and obtain C=(C1,…,Ck) for couple of units (a,b)

Fellegi-Sunter (1969) Compute Couples (a,b) can be ordered and classified in M* and U* (or undefined Q*) sets according to r

Fellegi-Sunter The thresholds are assigned solving equations that minimize both the size of the set Q and the false match rate and false non-match rate

Modules for record linkage-object matching in the handbook • Object matching (record linkage) • Object identifier matching • Unweighted matching of object characteristics • Weighted matching of object characteristics • Probabilistic record linkage • Fellegi-Sunter method for record linkage

Example of integration of ds with different units – (Statistical matching) • Combining Farms • Farm Structure Survey • Farm Accountancy Data survey • Combining Income - Consumption • Household Budget Survey • Bank of Italy survey on Income

Statistical matching

Statistical matching methods • Imputation methods • Parametric methods • Non-parametric methods (donor imputation) • Mixed methods

Mixed methods • Estimate a parametric model (e.g. regression) • Use the model in step 1 to predict values in both the data sets (e.g. recipient A, donor B) • Use predicted values for finding a donor to impute in the recipient A (e.g. find the nearest neighbour in B according to a distance computed on regressed values)

Limitations and alternatives in SM • Naïve methods are implicitly based on the conditional independence assumption (Y and Z are independent given common vars X). • To overcome 1, use auxiliary information on Y and Z, e.g., outdated data, proxy variables… • computation of uncertainty bounds, i.e., the bounds of unidentifiable parameters (e.g., correlation of Y and Z).

Modules for statistical matching in the handbook • Matching different observations from different sources (statistical matching) • Statistical matching methods

Obtaining consistent data - Example Key vars on reliable admin data (e.g., Turnover, n. Employees, tot wages paid - Wages). The SBS requires more detail so -> A sample survey is conducted to obtain additional details. For Turnover and other key vars, the register values are used and survey values for the other variables.

Integration of data sources with different units - Example

Obtaining consistent data - Example • Business records have to adhere to a number of accounting rules and logical constraints, e.g • e1: x1 – x5 + x8 = 0 (Profit = Turnover – Total Costs) • e2: –x3 + x5 – x4 = 0 (Turnover = Turnover main + Turnover other) • e3: –x6 – x7 + x8 = 0 (Total Costs = Wages + Other costs)

Integration of data sources with different units - Example and… • violation of the edit-rules • to obtain a consistent record some of the values have to be changed or “adjusted''.

Adjusting methods • Prorating, • Minimum adjustment methods • Generalised ratio adjustment.

Minimum adjustment methods • e1: x1 – x5 + x8 = 0 (Profit = Turnover – Total Costs) • e2: –x3 + x5 – x4 = 0 (Turnover = Turnover main + Turnover other) • e3: –x6 – x7 + x8 = 0 (Total Costs = Wages + Other costs) can be expressed in the form Ex = c with

Minimum adjustment methods • More in general edits can be expressed as Ax >= b The min. adj. method consists in finding a solution for x0 :observed values of vars that can be modified

Microintegration modules in the handbook • Reconciling conflicting micro-data • Prorating • Minimum adjustment methods • Generalised ratio adjustments

Authors of the modules • Introductory module • Di Zio M. (Istat) • Modules on record linkage - object matching • Willneborg L., Van de Laar R. (CBS) • Tuoto T., Cibella N. (Istat) • Modules on statistical matching • Scanu M., D’Orazio M. (Istat) • Modules on microintegration • Pannekoek J. (CBS)

Micro-Fusion Techniques in Data Integration Handbook

Micro-Fusion Techniques in Data Integration Handbook

Presentation Transcript

Fusion

Micro

Fusion

Nuclear Fusion

Fusion

fusion

Fusion

Fusion Imaging

Fusion

fusion

FUSION

Information Fusion

Fusion

Data Fusion

Micro-Fusion

Fusion

Nuclear Fusion D-T Fusion Reactions

Fusion-Incomplete Fusion

Fusion-Incomplete Fusion

Fusion Machine