1 / 26

Micro-Fusion Techniques in Data Integration Handbook

Explore methods for integrating data sources in Micro-Fusion settings with Memobust, focusing on techniques like record linkage and statistical matching. Learn about key variables and limitations in statistical matching for consistent data output.

masona
Download Presentation

Micro-Fusion Techniques in Data Integration Handbook

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Micro-Fusion

  2. Presented by • Marco Di Zio • Istat – Italian National institute of Statistics

  3. Outline • Micro-Fusion in Memobust • Objectives characterising Micro-Fusion settings • Focus on some methods • Structure of the Memobust handbook section concerning micro-fusion

  4. Micro-Fusion in Memobust • Micro Fusion in Memobust • Integration of data sources composed of units (input: micro) in order to obtain still a data set composed of units (output: micro). • It is focused on statistical techniques

  5. Main settings of M-F in Memobust Integration of data sources composed of the same units (Record linkage-Object matching) Integration of sources composed of different units (Statistical matching) Make integrated data to be consistent (Microintegration)

  6. Example of Integration with the same units – (Record linkage-object matching) • There is a register where main variables are observed, we want to integrate with information from administrative sources and sample surveys • Register of businesses with main characteristics: NUTS, NACE, n. employee,.. • Financial statements and the Tax Authority sources • Small and medium enterprise survey

  7. Frameworks for record linkage • Unique unit identifier without error • unit identifier should be created by the available variables, without error • unit identifier should be created by the available variables, affected by errors

  8. The fellegi-sunter decision rule • Data sources A and B (NA and NB obs) • Choose k match vars (common) X1,…,Xk • Compare (e.g. ci=1 if Xi in A eq Xi in B, or ci=0 otherwise) and obtain C=(C1,…,Ck) for couple of units (a,b)

  9. Fellegi-Sunter (1969) Compute Couples (a,b) can be ordered and classified in M* and U* (or undefined Q*) sets according to r

  10. Fellegi-Sunter The thresholds are assigned solving equations that minimize both the size of the set Q and the false match rate and false non-match rate

  11. Modules for record linkage-object matching in the handbook • Object matching (record linkage) • Object identifier matching • Unweighted matching of object characteristics • Weighted matching of object characteristics • Probabilistic record linkage • Fellegi-Sunter method for record linkage

  12. Example of integration of ds with different units – (Statistical matching) • Combining Farms • Farm Structure Survey • Farm Accountancy Data survey • Combining Income - Consumption • Household Budget Survey • Bank of Italy survey on Income

  13. Statistical matching

  14. Statistical matching methods • Imputation methods • Parametric methods • Non-parametric methods (donor imputation) • Mixed methods

  15. Mixed methods • Estimate a parametric model (e.g. regression) • Use the model in step 1 to predict values in both the data sets (e.g. recipient A, donor B) • Use predicted values for finding a donor to impute in the recipient A (e.g. find the nearest neighbour in B according to a distance computed on regressed values)

  16. Limitations and alternatives in SM • Naïve methods are implicitly based on the conditional independence assumption (Y and Z are independent given common vars X). • To overcome 1, use auxiliary information on Y and Z, e.g., outdated data, proxy variables… • computation of uncertainty bounds, i.e., the bounds of unidentifiable parameters (e.g., correlation of Y and Z).

  17. Modules for statistical matching in the handbook • Matching different observations from different sources (statistical matching) • Statistical matching methods

  18. Obtaining consistent data - Example Key vars on reliable admin data (e.g., Turnover, n. Employees, tot wages paid - Wages). The SBS requires more detail so -> A sample survey is conducted to obtain additional details. For Turnover and other key vars, the register values are used and survey values for the other variables.

  19. Integration of data sources with different units - Example

  20. Obtaining consistent data - Example • Business records have to adhere to a number of accounting rules and logical constraints, e.g • e1: x1 – x5 + x8 = 0 (Profit = Turnover – Total Costs) • e2: –x3 + x5 – x4 = 0 (Turnover = Turnover main + Turnover other) • e3: –x6 – x7 + x8 = 0 (Total Costs = Wages + Other costs)

  21. Integration of data sources with different units - Example and… • violation of the edit-rules • to obtain a consistent record some of the values have to be changed or “adjusted''.

  22. Adjusting methods • Prorating, • Minimum adjustment methods • Generalised ratio adjustment.

  23. Minimum adjustment methods • e1: x1 – x5 + x8 = 0 (Profit = Turnover – Total Costs) • e2: –x3 + x5 – x4 = 0 (Turnover = Turnover main + Turnover other) • e3: –x6 – x7 + x8 = 0 (Total Costs = Wages + Other costs) can be expressed in the form Ex = c with

  24. Minimum adjustment methods • More in general edits can be expressed as Ax >= b The min. adj. method consists in finding a solution for x0 :observed values of vars that can be modified

  25. Microintegration modules in the handbook • Reconciling conflicting micro-data • Prorating • Minimum adjustment methods • Generalised ratio adjustments

  26. Authors of the modules • Introductory module • Di Zio M. (Istat) • Modules on record linkage - object matching • Willneborg L., Van de Laar R. (CBS) • Tuoto T., Cibella N. (Istat) • Modules on statistical matching • Scanu M., D’Orazio M. (Istat) • Modules on microintegration • Pannekoek J. (CBS)

More Related