270 likes | 412 Views
Scottish Social Survey Network: Master Class 1 Data Analysis with Stata. Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling The SSSN is funded under Phase II of the ESRC Research Development Initiative. Data Analysis and Data Management with Stata.
E N D
Scottish Social Survey Network: Master Class 1Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23rd January 2008, University of Stirling The SSSN is funded under Phase II of the ESRC Research Development Initiative
Background: Integrating data management and data analysis “A programme like SPSS … has two main components: the statistical routines, that do the numerical calculations…, and the data management facilities. Perhaps surprisingly, it was the latter that really revolutionised quantitative social research” (Procter, 2001:253). By Data management we mean: • Matching data files together • ‘Cleaning’ data • Operationalising variables • Accessing and reviewing data
Research interests, data analysis and data management (1) • Research-led pressures for large and complex survey data • Longitudinal surveys • Linked data projects • e.g. administrative data; health data; GIS • Comparative research • e.g. x-national, historical • social survey researchers enjoy access to a vast array of micro-data resources, many of which have (sometimes hidden) complexity
Check: what is large and complex social survey data? • Array of variables / operationalisations • Competing measures; interaction effects; latent variables • Multiple related data files • Linked component datasets • External data (e.g. aggregate and micro-data) • {Large volumes of cases} • Relations between cases • Multiple hierarchies of measurement • Multiple points of measurement • Unbalanced repeated contacts • {Censored} duration data • International comparative survey designs • Sample collection and weighting data
Example: Multiple measurement points (BHPS Unbalanced panel)
E.g.: array of variables and sample selection (BHPS occ data)
Check: Variable operationalisations? • processes by which survey measures are defined and subsequently interpreted by research analysts • Some prescriptive advice (e.g. ONS, EU) • Variable operationalisations in longitudinal research • http://www.longitudinal.stir.ac.uk/variables/ • Themes from comparative research • ‘universality’ and ‘specificity’ • Importance of documentation / metadata • {See Scottish Social Survey Network seminar tomorrow 24th Jan} • {See example on occupations this afternoon} Student’s Law:…In survey data analysis, somebody else has already struggled through the variable constructions you are working on right now…
Research interests, data analysis and data management (2) • Availability and advocacy of complex methods of data analysis • Complex statistical approaches • Multi-process models (CQeSS, http://e-science.lancs.ac.uk/cqess/) • Latent variable and Multilevel analysis • Missing data analysis (e.g. www.missingdata.org.uk) • See the SSSN Master Class programme..!! • Challenging methodological approaches • Mixed methods research • See esp. the ESRC NCRM (http://www.ncrm.ac.uk/ ) • Daily work of survey researchers straddle social science and statistical traditions
A research capacity shortfall? • Concern that UK lacks sufficient trained social researchers with quantitative analytical skills • Criticism that social scientists don’t sufficiently exploit empirical survey data • Insufficient impact of published analyses • Published analyses are too simple and crude • {this doesn’t really apply to economics!} • This is in some ways a puzzle, given dramatic progress in the availability of survey data (e.g. www.data-archive.ac.uk) and in resources for statistical analysis
Returning to survey data management… • Simple survey data management • Short recodes; selecting cases; one small data file • taught in many textbooks and reasonably widely understood by most users of SPSS, Stata, etc • Complex survey data management • Matching multiple data files; complex variable operationalisations; complex relations between cases • Is rarely taught in textbooks/courses • Is usually required at some stage • Often puts off non-specialists
A substantial social science need for improved standards and resources in data management • In practice, social researchers often spend more time on data management than any other part of the research process • A ‘methodology’ of data management is relevant to social science literatures on ‘harmonisation’, ‘comparability’ DAMES ONS support ESDS support UK Data Archive Qualidata Flagship social surveys Office for National Statistics Administrative data Specialist academic outputs NCRM workshops Essex summer school ESRC RDI initiatives CQeSS Data Management Data access / collection Data Analysis
Confronting complex data management… There are two related possibilities • Generic resources and services for (survey) data management • Format independence • Computer science research (e-science) • Specialist support for key social survey data management approaches • Directed to specific software formats • Directed to specific example datasets
(i) DAMES – Data Management through e-Social Science ESRC National Centre for e-Social Science research Node, University of Stirling / University of Glasgow, 2008-2011 Case studies, provision and support for data management in the social sciences 4 social science themes • Grid Enabled Specialist Data Environments • occupations; education; ethnicity • Micro-simulation on social care data • Linking e-Health and social science databases • Training and interfaces for data management support Underlying computer science research themes • Linking heterogeneous and distributed data; metadata; data abstraction and data fusion; workflow modelling; data security
(ii) Specialist support for survey research communities • Scottish Social Survey Network • Focussed advice on smallish range of • Key surveys • Key variables • Stata and survey data management • Stata combines extensive routines for data analysis with extensive routines for data management
Stata and its competitors (1) Claim: Stata offers unparalleled convenience in combining pre-programmed data analytical and data management functionality • Ease of data access, manipulation and review • Conditional processing (‘if’, ‘by’) • Succinct command syntax • Ability to read online files • Exporting / saving results and graphs • Regression model outputs • Matrix manipulation of model results • Development of new analytical routines • Research community posting new models (researcher driven) • Complex data estimators (svy; cluster; xt; xtmixed)
Stata and its competitors (2) Claim: Stata is ultimately much more powerful, but it is not always well designed • Batch files / interactive syntax / programs: • Stata has more flexibility, but SPSS interactive syntax is easier (e.g. delimiters) • Direct data entry / browsing • Stata is clumsy – easier to use SPSS or another package • Variable and value labels and presenting outputs • SPSS quicker and better presentation; Stata needs more effort • Computing / recoding / conditional processing • Stata more extensive (eg ‘by’ and ‘if’); SPSS easier to use – eg Stata won’t allow overwriting an existing variable • Missing values / weighting data • Stata’s default settings cause more confusion than SPSS • Stata has some restrictions on its weights / SPSS easier • Complex data estimators (svy; cluster; xt; xtmixed) • Unique and advantageous feature of Stata • But many Stata models are very slow to estimate – e.g. GLLAMM
Some existing resources on data management • Stata’s files: http://www.stata.com/support/faqs/data/ • LDA WebCT site www.longitudinal.stir.ac.uk, worked examples of data management on complex survey data using SPSS and Stata: • ‘introductory training in data analysis’ • ‘longitudinal research resources’ • Model – ‘learn by doing’… • Researcher input: • Importance of logging your work (‘syntax’ / ‘do’ files) • Consistent use of file paths / annotation of command files
Stata lab 23/1/08: illustrating integrated data management and analysis • Example files from ‘Longitudinal data analysis’ www.longitudinal.stir.ac.uk • 4 LDA files with extended examples • {Data (from UKDA) should be in place on machines for today} • First lab: a selective summary file • Concentrates on matching data and manipulating variables
Variable management in Stata • Painful text value label processes.. • Recoding data examples • Use of ‘do’ and ‘ado’ batch files • Matching with aggregate datasets • Further resources on operationalising variables: see talk on ‘Handling occupational data’
Matching files • Complex data inevitably involves more than one related data file • Multiple related files are almost inevitable with longitudinal data collections • A vital data analysis skill!! • Link data between files by connecting them according to key linking variable(s) • Eg, ‘person identifier’ variable ‘pid’ • Eg : iserwww.essex.ac.uk/ulsc/bhps/doc/ See SPSS and Stata example command files within LDA Website
Types of file matching • Addition of files • E.g. two files with same variables for different people • Stata: append using file2.dta • SPSS: add files file=“file1.sav” /file=“file2.sav” . • Case-to-case matching • One-to-one link, eg two files with different sets of variables for same people • STATA: merge pid using file2.dta • SPSS: match files file=“file1.sav” /file=“file2.sav” /by=pid. • Table distribution • One-to-many link, eg one file has individuals, another has households, and match household info to the individuals • STATA: merge pid using file2.dta • SPSS: match files file=“file1.sav” /table=“file2.sav”/by=pid .
Types of file matching, ctd. • Aggregating • Summarise over multiple cases • Stata: -collapse (mean) inc , by(pid) or - egen avinc=mean(inc), by(pid) • SPSS: aggregate outfile=“file2.sav” /break=pid /avinc=mean(inc) • Output files from aggregate / collapse are often linked back into the micro-data from which they are derived • Related cases matching • Link info from one related case to another case, eg info on spouse put on own case • Stata: -mergepid using file2.dta or -joinby … • SPSS: match files file=“file1.sav” /file=“file2.sav” /by=pid.
File matching crib: Stata: _merge = indicator of cases present for: 1 = Master file but not input file 2 = Input file but not Master file 3 = Master and input file Remember to drop auto-generated _merge before performing next merge command