Session 27 : Resources for Data Management and Handling Social Science Data

Session 27: Resources for Data Management and Handling Social Science Data 3rd ESRC Research Methods Festival, Oxford, 1 July 2008 Workshop organised by the ‘Data Management through e-Social Science’ (DAMES) research Node of the National Centre for e-Social Science www.dames.org.uk / www.ncess.acuk NCRM, Session 27, 1 July 2008

Resources for Data Management and Handling Social Science Data NCRM, Session 27, 1 July 2008

Data management & handling social science data: Key issues, concerns, & the relevance of e-Science • The nature of data management • Key issues and concerns • good habits and principles • challenges • The contributions of… • e-Social Science • the DAMES Node (www.dames.org.uk) NCRM, Session 27, 1 July 2008

‘Data management’ means… • ‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES research Node..] • Usually performed by social scientists themselves • Most overt in quantitative survey data analysis • ‘variable constructions’, ‘data manipulations’ • navigating abundance of data – thousands of variables • Usually a substantial component of the work process NCRM, Session 27, 1 July 2008

Some components… • Manipulating data • Recoding categories / ‘operationalising’ variables • Linking data • Linking related data (e.g. longitudinal studies) • combining / enhancing data (e.g. linking micro- and macro-data) • Secure access to data • Linking data with different levels of access permission • Detailed access to micro-data cf. access restrictions • Harmonisation standards • Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) • Recommendations on particular ‘variable constructions’ • Cleaning data • ‘missing values’; implausible responses; extreme values NCRM, Session 27, 1 July 2008

Example – recoding data NCRM, Session 27, 1 July 2008

Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk NCRM, Session 27, 1 July 2008

A bit of focus… • I tend to emphasise two data management activities: • Variable constructions • Coding and re-coding values • Linking datasets • Internal and external linkages NCRM, Session 27, 1 July 2008

So why this workshop? • DM is a big part of the research process • ..but receives limited methodological attention • Poor practice in soc. sci. DM is easily observed • Not keeping adequate records • Not linking relevant data • Not trying out relevant variable operationalisations • Even though.. • There are plenty of existing resources and standards relevant to data management activities • There are suitable software and internet facilities • People are working on DM support (e.g. DAMES) NCRM, Session 27, 1 July 2008

DAMES research Node • social researchers often spend more time on data management than any other part of the research process DAMES ONS support ESDS support UK Data Archive Qualidata Flagship social surveys Office for National Statistics Administrative data Specialist academic outputs NCRM workshops Essex summer school ESRC RDI initiatives CQeSS Data Management Data access / collection Data Analysis NCRM, Session 27, 1 July 2008

DM: Some further considerations • DM as stumbling block in research conduct • UK has ample data, ample analytical resources, but low levels of exploitation (esp. of complex data) • Capacity building aims in DAMES • Lots of previous work in this field • ..See below.. • ‘Data management’ also sometimes means.. • Data distributors supplying and monitoring use of particular datasets (e.g. UK Data Archive DM guides) NCRM, Session 27, 1 July 2008

2. Key issues and concerns • (4) good habits and principles • (3) Challenges ..Not solely about survey research.. NCRM, Session 27, 1 July 2008

(2.1) Good habit: Keep clear records of your DM activities Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle Cf. Dale 2006; Freese 2007 • In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples: www.longitudinal.stir.ac.uk NCRM, Session 27, 1 July 2008

Stata syntax example (‘do file’) NCRM, Session 27, 1 July 2008

Some comments on survey analysis software.. • “A program like SPSS .. has two main components: the statistical routines, .. and the data management facilities. Perhaps surprisingly, it was the latter that really revolutionised quantitative social research”[Procter, 2001: 253] • “Socio-economic processes require comprehensive approaches as they are very complex (‘everything depends on everything else’). The data and computing power needed to disentangle the multiple mechanisms at work have only just become available.”[Crouchley and Fligelstone 2004] NCRM, Session 27, 1 July 2008

Some personal comments on survey analysis software.. • Data management and data analysis must be seen as integrated processes • Stata is the most effective software, as it achieves advanced DM and DA functionality and makes good documentation easy • Others argue that more advanced analytical techniques necessitate other packages – I’m not convinced NCRM, Session 27, 1 July 2008

(2.2) Principle: Use existing standards and previous research • Variable operationalisations Use recognised recodes / standard classifications • ONS harmonisation standards • [Shaw et al. 2007] • Cross-national standards. [Hoffmeyer-Zlotnick & Wolf 2003] Use reproducible recodes / classifications (paper trail) • Other data file manipulations • Missing data treatments • Matching data files (finding the right data) NCRM, Session 27, 1 July 2008

(2.3) Principle: Do something, not nothing • We currently put much more effort into data collection and data analysis, and neglect data manipulation • Survey research – the influence of ‘what was on the archive version’ …In my experience, a common reason why people didn’t do more DM was because they were frightened to… NCRM, Session 27, 1 July 2008

(2.4) Principle: Learn how to match files Complex data (complex research) is distributed across different files In surveys, use a key linking variable for... • One-to-one matching SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta • One-to-many matching (‘table distribution’) SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid . Stata: merge pid using file2.dta • Many-to-one matching (‘aggregation’) SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid) NCRM, Session 27, 1 July 2008

Some challenges for data management.. (2.5) Agreeing about variable constructions • Unresolved debates about optimal measures and variables • Esp. in comparative research such as across time, between countries • http://www.longitudinal.stir.ac.uk/variables/ NCRM, Session 27, 1 July 2008

Some challenges for data management.. (2.6) Worrying about data security • DM activities could challenge data security • Inspecting individual cases • Multiple copies of related data files • Ability to link with other datasets ‘Hands-on’ model of data review • New and exciting data resources • have more individual information • are more likely to be released with stringent conditions • may jeopardize traditional DM approaches NCRM, Session 27, 1 July 2008

Some challenges for data management.. (2.7) Incentivising documentation / replicability • There is little to press researchers to better document DM, but much to press them not to • Make DM and its documentation easier? • Reward documentation (e.g. citations)? NCRM, Session 27, 1 July 2008

3) The relevance of e-Science • ‘Data management through e-Social Science’ • ‘E-Science’ refers to adopting a number of particular approaches and standards from computing science, to applied research areas • These approaches include ‘the Grid’; distributed computing; data and computing standardisation; metadata; security; research infrastructures • DAMES (2008-11) – developing services / resources using e-Science approaches which will help social scientists in undertaking data management tasks NCRM, Session 27, 1 July 2008

E-Science and Data Management E-Science isn’t essential to good DM, but it has capacity to improve and support conduct of DM… • Concern with standards setting in communication and enhancement of data • Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources • Contribution of metadata tools/standards for variable harmonisation and standardisation • Linking data subject to different security levels • The workflow nature of many DM tasks NCRM, Session 27, 1 July 2008

E.g. of GEODE: Organising and distributing specialist data resources (on occupations) NCRM, Session 27, 1 July 2008

The contribution of DAMES8 project themes NCRM, Session 27, 1 July 2008

DAMES agenda • Useful social science provisions • Specialist data topics – occupations; education qualifications; ethnicity; social care; health • Mainstream packages and accessible resources • To exploit / engage with existing DM resources • In social science – e.g. CESSDA • In e-Science – e.g. OGSA-DAI; OMII NCRM, Session 27, 1 July 2008

..End of talk 1.. NCRM, Session 27, 1 July 2008

Appendix Existing resources – sources and types of support for data management in the social sciences: NCRM, Session 27, 1 July 2008

Existing resources (i): Data providersa) Documentation and metadata files NCRM, Session 27, 1 July 2008

Existing resources (i): Data providers • Resources for variables • CESSDA PPP on key variables http://www.nsd.uib.no/cessda/project/ • UK Question Bank http://qb.soc.surrey.ac.uk/ • ONS Harmonisation http://www.statistics.gov.uk/about/data/ • Resources for datasets • UK Census data portal, http://census.ac.uk/ • IPUMS international census data facilities, www.ipums.org • European Social Survey, www.europeansocialsurvey.org • Data manipulations prior to data release • Missing data imputation / documentation • Survey design / weighting information • Influential – most analysts use ‘the archive version’ NCRM, Session 27, 1 July 2008

Existing resources (ii) Resource projects / infrastructures • UK ESDS www.esds.ac.uk ESDS International | ESDS Government ESDS Longitudinal | ESDS Qualidata • Helpdesks; online instructions; user support.. • UK ESRC NCRM / NCeSS / RDI initiatives • Longitudinal data – www.longitudinal.stir.ac.uk • Linking micro/macro - www.mimas.ac.uk/limmd/ • Other resources / projects / initiatives • EDACwowe - http://recwowe.vitamib.com/datacentre • …. NCRM, Session 27, 1 July 2008

Existing resources (iii) Analytical and software support • Textbooks featuring data management • [Levesque 2008] [Sarantakos 2007] • Software training covering DM • Stata’s ‘data management’ manual • SPSS user group course on syntax and data management, www.spssusers.co.uk But generally, sustained marginalisation of DM as a topic • Advanced methods texts use simplistic data • Advanced software for analysis isn’t usually combined with extended DM requirements NCRM, Session 27, 1 July 2008

Existing resources (iv) Data analysts’ contributions • Academic researchers often generate and publish their own DM resources, e.g. Harry Ganzeboom on education and occupations, http://home.fsw.vu.nl/~ganzeboom/pisa/ Provision of whole or partial syntax programming examples • Analysts often drive wider resource provisions related to DM CAMSIS project on occupational scales, www.camsis.stir.ac.uk CASMIN project on education and social class NCRM, Session 27, 1 July 2008

Existing resources (v) Literatures on harmonisation and standardisation • National Statistics Institutes’ principles and practices E.g. ONS www.statistics.gov.uk/about/data/harmonisation/ • Cross-national organisations E.g. UNSTATS - http://unstats.un.org/unsd/class/ • Academic studies E.g. [Harkness et al 2003]; [Hoffmeyer-Zlotnick & Wolf 2003] [Jowell et al. 2007] NCRM, Session 27, 1 July 2008

References • Blossfeld, H. P., & Rohwer, G. (2002). Techniques of Event History Modelling: New Approaches to Causal Analysis, 2nd Edition. Mawah, NJ: Lawrence Erlbaum Associates. • Crouchley, R., & Fligelstone, R. (2004). The Potential for High End Computing in the Social Sciences. Lancaster: Centre for Applied Statistics, Lancaster University, and http://redress.lancs.ac.uk/document-pool/hecsspotential.pdf. • Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. • Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007. • Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. New York: Wiley. • Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers. • Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage. • Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS 16.0: A Guide for SPSS and SAS users. Chicago, Il.: SPSS Inc. • Procter, M. (2001). Analysing Survey Data. In G. N. Gilbert (Ed.), Researching Social Life, Second Edition (pp. 252-268). London: Sage. • Sarantakos, S. (2007). A Tool Kit for Quantitative Data Analysis Using SPSS. London: Palgrave MacMillan. • Shaw, M., Galobardes, B., Lawlor, D. A., Lynch, J., Wheeler, B., & Davey Smith, G. (2007). The Handbook of Inequality and Socioeconomic Position: Concepts and Measures. Bristol: Policy Press. NCRM, Session 27, 1 July 2008

Session 27 : Resources for Data Management and Handling Social Science Data

Session 27 : Resources for Data Management and Handling Social Science Data

Presentation Transcript

Data Handling

Handling Data

Data Handling

Data Handling in Science

CS639: Data Management for Data Science

Handling Data

CS639: Data Management for Data Science

Data Handling

Session 6: Data Flow, Data Management, and Data Quality

Data Handling

Data Handling

Handling Data

Data Handling

DATA HANDLING

Data Handling