Discipline and Data Management: Good practice

Discipline and Data Management: Good practice Professor Vernon Gayle University of Stirling

Overall Message • Working with a programming “syntax” is essential • Do NOT use menus to operate data management software

Some Practical Thoughts… “The best habit you can get into is to get into good habits”

The Paper Trail • Ensure that all serious work can be reproduced i.e. have a clear ‘paper trail’ in place You might have to reproduce additional work much later (months or even years) for example after referees comments, external examiners remarks etc.

The Paper Trail • The platinum standard is that if a research assistant/fellow was killed in a freak accident the professor could complete the project See Long (2009) Section 1.1 Replication: The guiding principle for workflow

The Paper Trail • The gold standard is that all files and notes are correctly and clearly set out so that they can be passed on to someone without much explanation • This will mean that you and the other members of the research team can follow the paper trail and therefore subsequently reproduce and augment material if required • This is particularly important as referees can often ask for minor, and in the case of some of my work major, amendments to statistical analysis

Why Bother? The payoff from discipline… Work can be replicated Work can be transferred (e.g. personnel changes) Work ultimately becomes quicker • Getting into a muddle less often • Getting out of a muddle quicker Increasing important in research governance and openness and ethics

Making A Start • IT IS ESSENTIAL TO KNOW YOUR DATA • This includes understanding how concepts have been operationalised (e.g. via the survey instrument). It is worth thinking about how the survey instrument has been applied. Think about all the tiny nuts and bolts, for example the rubric of questions and how the routing has been worked out. These minor issues may have a major impact on your data • Understanding how variables have been measured and coded is OBVIOUSLY essential. It is also worth getting to know the distribution of variables and some simple measures of central tendency (e.g. means and modes)

Making A Start • Make sure that you are working with the best data available. In the case of the BHPS this will be the most recent release of the data • ALWAYS MAKE BACK-UP FILES - Work with as clean a set of data as possible • Always start with exploratory analysis • EVERY recode, compute, re-labelling task should be documented and be traceable in the paper trail • DON’T START ANALYSES TOO SOON!

Some Little Tricks • Always “guesstimate” the output before you formally estimate (i.e. fit) your model. This will help trap errors or indicate when your data is “behaving badly” • Always have a notebook handy (or use notepad or your word processor) to help with the paper trail • Keep a calculator handy • If a job is incomplete keep a record. For example I frequently e-mail myself at the end of the day so that I am reminded the next time I log on

A ‘Take-Home’ Message • REMEMBER – REAL DATA IS MUCH MORE MESSY, BADLY BEHAVED, HARD TO INTERPRET ETC. THAN THE DATA USED IN BOOKS AND AT WORKSHOPS

Which Software For Data Handling? • You need a software than can • Help you keep track • a clear and concise programming syntax • Merge data sets (e.g. waves of a panel survey) • Match data (e.g. individuals to households) • Easily recode variables (often repeatedly) • Exploratory data analysis facilities • Flexible and can store, report and compare results • Allow you to re-run jobs later (sometime months)

Which Software For Data Management? Stata Very Good SPSS Suitable (when syntax is used) Excel Inadequate for large-scale datasets R Needs to much programming SAS Needs too much programming MLwiN Difficult to do data management Mplus Difficult to do data management

Which Software For Data Management & Survey Research? • In our view, Stata has both excellent data management facilities and the majority of mainstream social science data analysis techniques can be undertaken in a Stata

Which Software For Data Management? • If you need to use a specialist data analysis package (e.g. SABRE, MLwiN) you might first need to do your data preparation (i.e. data management) in Stata first • This is a common approach but it still can be problematic because of the usual iterative cycle of data analysis • Many successful researchers are happy to settle with the functionality of Stata to avoid this problem

Main Window – Where main activity takes place, and output is reported

Review Window – We do not recommend using Stata in an ‘interactive mode’ so this window will not report too much

Variables Window – This window contains information about your dataset

Viewing data – Data is in a spreadsheet format Some people prefer to use the list command (list in 1/10) However others view the data in a spreadsheet using the data edit or data browse buttons

The Stata Do-File Editor Some people prefer to use a text editor or programming tool

*******************************************************. *** Longitudinal Data Analysis for Social Science Researchers ** ** ** ESRC Researcher Development Initiative training programme: ** ** ** Training materials lab 1: ** INTRODUCTORY LONGITUDINAL DATA ANALYSIS AND DATA MANAGEMENT - ** 5 APPROACHES TO QUANTITATIVE LONGITUDINAL DATA ANALYSIS . ** ** ** www.longitudinal.stir.ac.uk ** Paul Lambert / Vernon Gayle, 3 August 2008 *******************************************************. **** Stata VERSION ************************************** **********************************************************. *****************************************************. ** The file below covers introductory examples of five approaches to ** quantitative longitudinal data analysis: ** *** Section 1: Repeated cross-sectional survey data *** Section 2: Panel survey data *** Section 3: Cohort study survey data *** Section 4: Event history survey data *** Section 5: Time series statistical data ** **********************************************************. *******************************************************. ** GENERAL INSTRUCTIONS ON THESE FILES ** ** Work through this file in the interactive do-file editor, replicating ** the Stata do-file commands. Further help on working with Stata is ** available from the LDA web site. ** ** *** This lab file assumes you have a number of files downloaded to your ** machine. You will need the following: * ** 1) Downloadable from the LDA site : * - gb91soc2000.dat (this is used during variable constructions for the LFS exercise) * * ** 2) Downloadable from the UK Data Archive: * - ssa02.dta, ssa01.dta, ssa00.dta and ssa99.dta * (Scottish social attitudes 2002, 2001, 2000, 1999, * Stata datasets for study numbers 4808, 4804, 4503, 4346, Stata format files) * * - lfs1991.dta, qlfsja96.dta, qlfsja01.dta * (Labour Force Surveys mid 1991, 96, 2001 respectively, * Stata datasets from study numbers 2875, 3647, 4448) * (in previous editions of this exercise, the 1991 data was called f87511.dta) * * -All BHPS Waves 1-15 component files in Stata format (UK Data Archive Study number * 5151 (June 2007 release) (extracted from the zip file 5151Stata8.ZIP) * (warning - these are a large volume of files, ~153 different files, ~ 600MB) * * - The six Stata format 'episode' files from the BHPS Derived life history files * (UKDA study number 3954, 5th Edition) (covering waves 1-14 only) (you want to access * the 3 files data files on the top directory of the 3954*.zip archive, * called newpan.dta, xlempe.dta, and xljobe.dta, plus the 3 files in the * 'episode' subfolder of the 3954*.zip archive, called l*.dta) * * - 2364a.dta (National Child Development Study teaching dataset 1958-1981, * UKDA study number 2364, Stata data files from the zip archive 2364Stata6.ZIP) * * * * *******************************************************. Example of a (.do) syntax file – This file has good clear annotation

Stata Some more general points

Stata Software – good points • Does all the simple stuff (SPSS) • Fits many more models than standard software (esp Longitudinal) • Specialist survey analysis functions (Svy) • Increasingly important for complex datasets (e.g. UKHLS) • You can get started easily (menus and help) • Strong documentation • There is a growing user community (lists etc) • New features emerge almost daily • There are good labour market opportunities (UK little known; USA well known)

Stata Software – bad points • People are more used to the look of SPSS • Stata syntax has some quirks (e.g. set more off defining label, missing values etc) and a few unexpected limitations (e.g. constraints on display of tables of means) • There are some esoteric models that can’t be fitted (also some critiques of estimation procedures and their speed) • There is a growing user community, but they are generally GEEKBOYS (like myself!) • New features emerge almost daily these are sometimes tricky to get to grips with

Session settings set more off set mem 64M ‘Capture’ command drop varnamebeforegen varname See values + labels: numlabel _all, add File info: codebook Data overwrite: use data1.dta, clear save data2.dta, replace Processing commands: Whole batch file with do (don’t double-click ‘do’ files) Interactive do-file editor Like SPSS syntax window Line breaks with /// Define file locations: global path1 “c:\data\lda\“ use $path1\file1.dta, clear Browsing data: list in ??? Using Stata – some user tipssee Treiman(2009) p.84

Taking Stata further • Online resources • Stata website for FAQs, manual, training • Net use and update • Specialist modelling suites • XT – Cross sectional panel • ST – Survival data • SV – Survey data • Xtmixed - Multilevel models (v9) • GLLAMM • Programming: .do; .ado; macros

Discipline and Data Management: Good practice