1 / 27

Linking the DAMES & e-Stat Nodes

This article discusses the ‘Data Management through e-Social Science’ research node and its aim to provide useful social science provisions through specialist data topics. It explores the tasks associated with linking, coding, and accessing data resources within the analysis process. The article also highlights the importance of reproducibility and keeping clear records of data management activities.

nbetty
Download Presentation

Linking the DAMES & e-Stat Nodes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the ‘Data Management through e-Social Science’ research Node , www.dames.org.uk

  2. 1) Data Management though e-Social Science • DAMES – www.dames.org.uk • ESRC Node funded 2008-2011 • Aim: Useful social science provisions • Specialist data topics – occupations; education qualifications; ethnicity; social care; health • Mainstream packages and accessible resources • Aim: To exploit/engage with existing DM resources • In social science – e.g. ESDS, CESSDA • In e-Science – e.g. OGSA-DAI; OMII

  3. To us ‘Data management’ means… • ‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’[…DAMES Node..] • Usually performed by social scientists themselves • Pre-analysis tasks (though often revised/updated) • Inputs also from data providers • Usually a substantial component of the work process • But may not be explicitly rewarded (and sometimes penalised) • differentiate from archiving / controlling data itself

  4. Some components… • Manipulating data • Recoding categories / ‘operationalising’ variables • Linking data • Linking related data (e.g. longitudinal studies) • combining / enhancing data (e.g. linking micro- and macro-data) • Secure access to data • Linking data with different levels of access permission • Detailed access to micro-data cf. access restrictions • Harmonisation standards • Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) • Recommendations on particular ‘variable constructions’ • Cleaning data • ‘missing values’; implausible responses; extreme values

  5. Example – recoding data

  6. Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk

  7. Matching files (‘deterministic’) Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... • One-to-one matching SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta • One-to-many matching (‘table distribution’) SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid . Stata: merge pid using file2.dta • Many-to-one matching (‘aggregation’) SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid) • Many-to-Many matches • Related cases matching

  8. A bit of focus… • I tend to emphasise two data management activities: • Variable constructions • Coding and re-coding values • Linking datasets • Internal and external linkages

  9. ..plus the centrality of keeping clear records of DM activities Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle Cf. Dale 2006; Freese 2007 • In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples: www.longitudinal.stir.ac.uk

  10. Principle DAMES services (current status) • GESDE specialist data environments (prototypes) Occupations, educational qualifications, ethnicity • Data curation tool (prototype) • Data fusion tool (prototype) • Secure data demonstrator for e-Health research (complete) • Micro-simulation model for social care data (prototype) • Training workshops and events (in progress)

  11. GEMDE – Grid Enabled Specialist Data Environments

  12. GEODE – Occupational data

  13. Data curation tool The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

  14. Data fusion tool

  15. 2. Linking DAMES and e-Stat High level vision is to ingrain data management functionality and uptake within e-Stat modelling capabilities • Using/adapting DAMES contributions • DAMES services for data linking • DAMES resources for recoding variables • Making replication central to the data story

  16. Data and variables • DAMES does not in general provide routes to new/alternative microdata, but to relevant supplementary data (e.g. aggregate data) • Anything on educational qualifications, occupations, ethnicity is of particular interest • Generic tools for merging micro-data • Generic tools for other variable processes

  17. Data oriented review • Applied research perspective • Range of data resources • Accessing and documenting data resource options

  18. The implementation for e-Stat • This is mostly a blank space… • …and we’ve not hitherto used Python • Data curation tool and GEODE/GEEDE use IRODS • GEMDE uses a bespoke SQL database • Data fusion tool uses R (and some Stata) scripts accessed via a Liferay portal

  19. 3. A pitch for specific e-Stat facilities ..harvest the best of data analysis packages from applied data perspective • Replication in ‘human readable syntax’ • Something like Stata’s ‘est store’ for multiple model comparisons • Fluency in data oriented options • Training resources in data

  20. Est store demo here

  21. Appendix items

  22. Model1: Analytical file Spouse CAMSIS BHPS, wave A individuals Graphics Spouse SOC Current job RGSC Gender BHPS wave B individuals. Age (yrs) Wave C Age bands Text interface Invoked manually or in response to manipulating graphs

  23. ‘The significance of data management for social survey research’(see http://www.esds.ac.uk/news/eventdetail.asp?id=2151) • The data manipulations described above are a major component of the social survey research workload • Pre-release manipulations performed by distributors / archivists • Coding measures into standard categories • Dealing with missing records • Post-release manipulations performed by researchers • Re-coding measures into simple categories • We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently • So the ‘significance’ of DM is about how much better research might be if we did things more effectively…

  24. Some provocative examples for the UK… • Social mobility is increasing, not decreasing! • Popularity of controversial findings associated with Blanden et al (2004) • Contradicted by wider ranging datasets and/or better measures of stratification position • DM: researchers ought to be able to more easily access wider data and better variables • Degrees, MSc’s and PhD’s are getting easier! • {or at least, more people are getting such qualifications} • Correlates with measures of education are changing over time • DM: facility in identifying qualification categories & standardising their relative value within age/cohort/gender distributions isn’t, but should, and could, be widespread • ‘Black-Caribbeans’ are not disappearing! • As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly prominent due to return migration and social integration of immigrant descendants • Data collectors under-pressure to measure large groups only • DM: It ought to remain easy to access and analyse survey data on Black-Caribbean’s, such as by merging survey data sources and/or linking with suitable summary measures

  25. Comment – growing interest in data management..? • Historically, references covering DM were few and far between • Dale, A., Arber, S., & Procter, M. (1988). Doing Secondary Analysis. London: Unwin Hyman Ltd. • Recently, there’s been a small burst of relevant references • Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS Statistics 17.0. Chicago, Il.: SPSS Inc. . • Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. • Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass. • http://www.esds.ac.uk/support/onlineguides.asp • http://www.longitudinal.stir.ac.uk/ • ..and growing interest re. ‘documentation for replication’ • Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. • Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007.

  26. E-Science and Data Management E-Science isn’t essential to good DM, but it has capacity to improve and support conduct of DM… • Concern with standards setting in communication and enhancement of data • Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources • Contribution of metadata tools/standards for variable harmonisation and standardisation • Linking data subject to different security levels • The workflow nature of many DM tasks

More Related