120 likes | 153 Views
Explore the importance of linking data from different sources, software tools for data linkage, key categories of data linkage, and various types of matches. Learn how to append and aggregate data for in-depth analysis.
E N D
Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on ‘Dealing with data: Using standard measures and variables and linking together datasets’ www.thinkdata.org.uk, 10 Mar 2016
Linking data resources? • …In the ‘big data’ tradition and era of ‘datafication’ we increasingly recognise the potential of bringing data together from different sources… • Social survey data plus administrative data • Different sources of by-product data • Social science data analysis has always benefitted from linking (quantitative) datasets (‘data management’) • Linking ‘microdata’ and ‘macrodata’ • Comparative analysis linking records from different years/countries/surveys • The importance of ‘identifiers’ • Software tools for linking data • Key categories of data linkage S-CSDP, 10 Mar 2016
1) The importance of identifiers • ‘id’ variable(s) • Numeric or string format • …Should uniquely identify each row in at least one of the data files… • Value of standard categories! • Post-processing to adapt formats? • Reconstruction based on combined characteristics? S-CSDP, 10 Mar 2016
2) Software tools for data linkage- Some popular software can be used to link data ‘on the fly’ (e.g. MS Excel, Access). Software designed for research data analysis has the attraction of purpose build match-merge commands and their syntactical documentation • Appending data SPSS: add files /file=“file1.sav” /file=“file2.sav”. Stata: use file1.dta, clear append using file2.dta • Aggregating data SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid) • One-to-one matching SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge 1:1 pid using file2.dta • One-to-many matching (‘table distribution’) SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid . Stata: merge m:1 pid using file2.dta • Many-to-Many matches (‘joinby’) • Related cases matching (see also www.dames.org.uk/workshops/ ) Collectively known as ‘match-merge’ functions or ‘deterministic matching’ S-CSDP, 10 Mar 2016
3) Key categories of data linkage • Probabilistic linkage versus deterministic linkage • Algorithmic approximations versus ‘match-merge’ operations • Linked data providers versus your own data processing • E.g. www.ipums.org (‘attach characteristics’) • Linked data in a secure/restricted setting versus linking accessible data • E.g. Scottish Longitudinal Study, see http://sls.lscs.ac.uk/ • E.g. British Household Panel Study, see https://www.iser.essex.ac.uk/bhps/documentation/volb/allrecs.html • E.g. Linking aggregate occupational data to survey microdata on occupations (talk 4) S-CSDP, 10 Mar 2016
Appending data • Add one or more datasets ‘on top of’ each other • Usually full or partial overlap of variables • Metadata preserved (but metadata from 1 file can overwrite another) • Typically used for ‘repeated cross-section’ surveys. S-CSDP, 10 Mar 2016
Aggregating data • Refers to generating new data of summary stats about original data (‘macrodata’) • Often then want to link aggregated data back to the original records (‘microdata’) • Most stats packages also allow generation of summary values and/or variables without aggregating the cases This bit is the aggregation S-CSDP, 10 Mar 2016
One-to-one match • Using a shared identifier to link records from different sources on same units • Here, responses from the same person at different time points (using ‘pid’) S-CSDP, 10 Mar 2016
One-to-many match • Use a shared identifier to send values from a unit to multiple relevant records • Common examples include using occupational data and macro-level cross-national data • Often called ‘table’ distribution • In Stata, take care to retain suitable cases (‘_merge’) only S-CSDP, 10 Mar 2016
Many-to-many match • Special scenario – want to distribute data to all permutations of linked records • E.g. witihin-household links; data on events; data on financial patterns; data on illnesses S-CSDP, 10 Mar 2016
Related cases matching • A version of one-to-one match-merge, where specific relations between units are defined and exploited • Use a purpose-built ‘alter’ identifier (e.g. ‘sppid’) or derive one from data • E.g., link data on a husband to a wife; data on a father to a daughter S-CSDP, 10 Mar 2016
Summary: Liking linking? • People often under-exploit their data by not implementing linkages when they might be helpful • Technical/software challenges • Sometimes, erroneous links are made • Clear documentation of data files can help • Syntactical documentation of tasks will help • E.g. Long 2009; Boslaugh 2005; • Standard measures (identifiers) will make things easier References cited • Boslaugh, S. (2005). An intermediate guide to SPSS programming: Using syntax for data management. London: Sage. • Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. S-CSDP, 10 Mar 2016