180 likes | 371 Views
Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards. Paul Lambert (Dept. Applied Social Science, Univ. Stirling) Vernon Gayle (Dept. Applied Social Science, Univ. Stirling and ISER, Univ. Essex) 27 th January 2009
E N D
Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards Paul Lambert (Dept. Applied Social Science, Univ. Stirling) Vernon Gayle (Dept. Applied Social Science, Univ. Stirling and ISER, Univ. Essex) 27th January 2009 Presented to the workshop ‘The significance of data management for social survey research’, University of Essex, a workshop organised by the Economic and Social Data Service (www.esds.ac.uk) and the Data Management through e-Social Science’ research Node of the National Centre for e-Social Science (www.dames.org.uk).
Manipulating data • Operations performed on datasets by researchers and/or data distributors • At any stage of the research lifecycle • Of considerable consequence to analytical results • DAMES Node: • ‘Data Management’ = manipulation of data, and documenting/assisting the processes of manipulation • E-Social Science approach to facilitating data manipulation (metadata resources; data access facilities; ‘workflow models’)
Deriving variables, handling missing data, and cleaning data ..Especially common types of data manipulation.. • Deriving variables = computing new measures for purposes of analysis • E.g. recoding complex categorical variables; standardising scores; linking micro- and macro-data • {Creating composite vars., e.g. selection model hazards, propensity scores, weights} • Handling missing data = strategies for item or case non-response • E.g. imputation approaches; listwise/pairwise deletion • {deriving ‘missing variables’ via ‘data fusion’} • Clarifying, stating & documenting assumptions (see www.missingdata.org.uk) • Cleaning data = monitoring and adjusting responses across a given set of variables • E.g. extreme values; erroneous values; re-scaling distributions;
In this talk…Practices, services and standards …For deriving variables, handling missing data, and cleaning data… • Practices • Key, or common, features of current approaches • Services • Resources available/conceivable • Standards • Preliminary thoughts on standards setting
(i) An brief illustrative example from the UK RAE 2008 • Research Assessment Exercise data published Dec 2008 • Extended reporting on basic data by media/within HE sector, e.g. • Cambridge leads the way • Nursing raises its status • Numerous enhancements/amendments to data & analysis could be easily generated, and often lead to a different story • Lambert, P.S. and Gayle, V. (2008). Data management and standardisation: A methodological comment on using results from the UK Research Assessment Exercise 2008 , University of Stirling: Technical Paper 2008-3 of the Data Management through e-Social Science Research Node (www.dames.org.uk).
…Extending analysis of the 2008 RAE using data manipulations... • Deriving variables • Commonly used RAE ‘Grade point average’ • [4.(%4*) + 3.(%3*) + 2.(%2*) + (%1*)] / 100 • Calculate alternative GPA measures • Standardise GPA within Units of Assessment • Rate Units of Assessment by external measures of relative ‘prestige’ • Link with 2001 standard thresholds • Other external data – e.g. Univ. typologies; RAE panel membership • Cleaning data • Of 159 HEI’s, 27 HEIs have only 1 UoA • cf.mean 15 UoA’s within HEI, max 53 (Manchester) • The single UoA HEI’s often have outlying GPAs • Analyses of averages might excluding these HEIs • Handling missing data • Less conventionally missing data (admin dataset) • But - not all HEI staff included within RAE; consider analysis accounting for number of excluded staff..?
Alternative RAE 2008 measures for Univ. Essex (within- and between-subject standardisations)
RAE data manipulations example – practices, services and standards • Practices • Media/HEI announcements concentrate upon simplistic, unweighted, unstandardised rankings/averages • Various alternative measures tell different stories – we found.. • LSE outranks Cambridge • Nursing ranks 6 least prestigious UoA from 67 • Services • Raw data available online: www.rae.ac.uk • Relevant supplementary data: www.hesa.ac.uk ; www.dames.org.uk • Standards • RAE level documentation on grading criteria and approach, www.rae.ac.uk • Software based Workflow approach (cf. Scott Long, 2009) • In our paper we show Stata syntax for derived variables (www.dames.org.uk)
(ii) Some wider thoughts on data manipulation practices, services and standards Currently…, • Practices are messy and painful • Lack of replication and consistency in data manipulation tasks with complex survey data • Few people relish data manipulations! • Services exist but are under-exploited • Standards are not agreed • Ignoring standards no barrier to publication
Practices: apparent trends Deriving variables, handling missing data, cleaning data • More interest in harmonisation and comparability • Longitudinal and cross-national data • Documentation challenges encourage simplifying approaches • New data and analytical opportunities • Increasing opportunities for enhancing data by linking at micro- or aggregate level • Increasing availability of routines for missing values, extreme values • Raising standards in secondary analysis of large scale surveys • Inadequacy of simple analyses which ignore multivariate relations, missing data, multiprocess systems, hierarchical structures • Data manipulations often conducted outside these considerations • Desirability of replication
Services: key challenges Deriving variables, handling missing data, cleaning data • Software issues • Dominance of major proprietary database packages • Other specialist/minority packages (e.g. MLwiN) • Documentation / replication between packages..? • Data security • Few services can offer to let experts take over a dataset • Approaches to reviewing data ought to avoid inspecting cases, duplicate copies • Keeping up-to-date? • Finding data - need for search facilities [via metadata] • Updating specialist advice • E.g. of GEODE, occupational data out of date before completion • NSI’s strict focus on contemporary data
Standards: key requirements Deriving variables, handling missing data, cleaning data • Need for documentation for replication • Detailed accounts of process • Citation of sources • DAMES – to facilitate with metadata and process tools • Resolving some difficult debates • Approaches to comparative research (measurement equivalence v’s meaning equivalence) • Necessary standards for analysis/reporting on missing data • Appropriate approaches to extreme values, e.g. robust regressions
Forthcoming DAMES contributions • Summer workshops on documenting manipulation and analysis of complex survey data • ‘To Stata and beyond..’ • Services for improving data manipulation activities • Specialist data on occupations, ethnicity, education • Specialist data on social care, mental health • Tools for performing data manipulations (linking data and operationalising variables) • Services for recording data manipulation activities • Workflow modelling tools • Metadata records for data linkages and variables • Citation information