560 likes | 717 Views
4th ESRC Research Methods Festival St Catherine’s College, Oxford. 5-8 July 2010. Dealing with variables: Resources and topics in enhancing secondary survey data. Paul Lambert University of Stirling DAMES research Node, www.dames.org.uk
E N D
4th ESRC Research Methods Festival St Catherine’s College, Oxford. 5-8 July 2010 Dealing with variables: Resources and topics in enhancing secondary survey data Paul Lambert University of Stirling DAMES research Node, www.dames.org.uk Part of session 17 ‘Resources (i): Resources for data management’ 6/JUL/2010
Dealing with variables: Resources and topics in enhancing secondary survey data • ‘Rigorous and vigorous’ approaches to dealing with variables • Three specialist topics: The GESDE services for data on occupations, ethnicity and educational qualifications
‘Data management’ applied to variables refers to… • ‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’[…DAMES Node..] • Usually performed by social scientists themselves • Pre-analysis tasks (though often revised/updated) • Inputs also from data providers • Usually a substantial component of the work process • But may not be explicitly rewarded (sometimes even penalised..) • a little different from archiving / controlling data itself
Some components in secondary survey research… • Manipulating data • Recoding categories / ‘operationalising’ variables • Linking data • Linking related data (e.g. longitudinal studies) • Combining / enhancing data (e.g. linking micro- and macro-data) • Secure access to data • Linking data with different levels of access permission • Full or restricted access to detailed micro-data • Harmonisation standards • Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) • Recommendations on particular ‘variable constructions’ • Cleaning data • ‘missing values’; implausible responses; extreme values
Example – recoding data [use a ‘recode’ or file matching routine]
..plus the centrality of keeping clear records of DM activities Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle Cf. Dale 2006; Freese 2007 • In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples: www.dames.org.uk/workshops/ www.longitudinal.stir.ac.uk
Some provocative examples for the UK… • Social mobility is increasing, not decreasing!! • Popularity of controversial findings associated with Blanden et al (2004) • Contradicted by wider ranging datasets and/or better measures of stratification position • DM: researchers ought to be able to more easily access wider data and better variables • Degrees, MSc’s and PhD’s are getting easier • {or at least, more people are getting such qualifications} • Correlates with measures of education are changing over time • DM: facility in identifying qualification categories & standardising their relative value within age/cohort/gender distributions isn’t, but should, and could, be widespread • ‘Black-Caribbeans’ are not disappearing • As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly prominent due to return migration and social integration of immigrant descendants • Data collectors under-pressure to measure large groups only • DM: It ought to be possible to harmonise measures of ethnicity over time, and to build richer data resources with more cases (e.g. by merging survey data) • People interpreted the RAE wrongly! • Most responses to the RAE 2008 involved comparing GPA scores between subject areas within and/or across institutions; but standardising relative to subject area distribution, or scaling by subject area, often gives very different results. • DM: see Lambert and Gayle (2008) for a demo of alternative uses of RAE data
What might a rigorous and vigorous variable analysis look like? ..open to debate but I’d nominate: • Replicability • Features a pro-active review of variables • Review a full set of alternative measures • Review alternative functional forms • Attention to distribution/standardisation • Attention to harmonisation
How should I make my work replicable? • The concept of a ‘workflow’ is a useful device for documenting a survey research project • Workflows involve organising materials as a series of interrelated but distinctive components • In survey research, software syntax files make excellent templates for documenting our work in component elements [Long, 2009; Treiman, 2009; Altman & Franklin, 2010; Kulas, 2008] • Computer science researchers have developed workflow depositories [e.g. MyExperiment] and workflow capture tools [e.g. Taverna]
Ad hoc organisation of a workflow as a ‘master file’ in Stata Forthcoming workshop: ‘Documentation and workflows for social survey research’, University of Stirling, 1-2 September 2010, see www.dames.org.uk
How should I review variables/functional forms/distributions/harmonisations? • We tend to rely on personal expertise in particular subject domains • Expertise of the depositor of the data • Expertise of the analyst Some textbooks and other capacity building events cover these topics generically [e.g. Treiman 2009], but by and large they get unduly neglected from methodological training …Something called ‘e-Science’ can help with both variable reviews and replication…
The ‘e-Social Science’ endeavoursee http://www.merc.ac.uk/ for up-to-date links • A number of UK projects seeking to improve social science research by capitalising on emerging computer science techniques • Handling distributed data; collaborative technologies; large and complex data; secure data • The ‘Grid’ embodies these technologies, but more generic terms like ‘e-Social Science’ & ‘Digital Social Research’ are increasingly preferred • GESDE: ‘Grid Enabled Specialist Data Environments’
Example: Understanding New Forms of Digital Records (DReSS) http://web.mac.com/andy.crabtree/NCeSS_Digital_Records_Node/DReSS.html • transcribed talk • audio • video • digital records • system logs • location video code tree transcript system log e-Social Science, BSA2009
This session part-organised by the ‘Data Management though e-Social Science’ node • DAMES – www.dames.org.uk • ESRC Node funded 2008-2011 • Aim: Useful social science provisions by exploiting tools for data management developed in computer science. Core components are: • Data curation tool • Data fusion tool • Portals for access to data and data resources
Data curation tool collects metadata and allows data resources of different formats to be organised in an accessible depository
Data fusion tool supports merging of data files through shared variables (e.g. for recodes, aggregations, pooling data, linking related data, probabilistic linkages)
GEMDE – Example of a ‘portal’ for distributing and accessing supplementry data related to ethnicity
2) Special Topics: The GESDE services for sociological classifications • ‘Key variables’ in social science research are not just for sociology, but are much debated there • Complex categorical measures and ‘variable operationalisation’ recommendations/debates • Individual level measures of social positioning… • ‘GESDE’ = 3 related online services which are “Grid Enabled Specialist Data Environments” • GEODE: the ‘o’ is for data on Occupations • GEEDE: the ‘e’ is for data on Educational qualifications • GEMDE: the ‘m’ is for data on ethnic Minorities
Our contribution in GESDE.. • Many existing resources on these topics [See app.] • Academic reviews and projects • [e.g. Rose & Harrison 2010; Ganzeboom, 2008; Schneider, 2008; Guveli, 2006] • Service providers • [e.g. ESDS variable guides; CESSDA-PPP] • National Statistics Institutes’ guidelines • [e.g. www.ons.gov.uk/about-statistics/harmonisation/] • It’d be good if more people were engaging with and exploiting these resources to enhance their own data..!
At the centre of this are problems of standardizing categorical data • ‘Measurement equivalence’ (e.g. van Deth, 2003) is often not feasible for complex categorical measures • For categorical data, equivalence for comparisons is often best approached in terms of meaning equivalence (because of non-linear relations between categories and shifting underlying distributions) (even if measurement equivalence seems possible) • Arithmetic standardisation offers a convenient form of meaning equivalence by indicating relative position with the structure defined by the current context • For categorical data, this can be achieved/approximated by scaling categories in one or more dimension of difference
‘Effect proportional scaling’ using parents’ occupational advantage
What was that then? • We can represent categories through positions on a scale • In turn, we can use position in the dimension as a category score which then plugs into a further analysis (e.g. regression main and interaction effects) ..E.g. some options for data on ethnicity.. • Stereotyped Ordered Logistic Regression (SOR) models, summarize dimensions of difference according to regression predictor values [e.g. Lambert and Penn, 2001] • Geometric data analysis for distances between people, or things [cf. Prandy, 1979; Bennett et al., 2009] • Assign category scores by hand (a priori or by selected average)
2(a) Data on occupations • Occupational unit groups = standardised lists of occupational titles • E.g. via CASCOT, www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/
..data on occupations.. • find ways of attaching summary information about occupations to occupational unit groups
Comparability problems => value of documenting methods & comparing alternatives
GEODE: Our contribution • GEODE acts as a library style service for access to ‘occupational information resources’ • We encourage people to supply data they’ve produced, and we upload data ourselves • Researchers are encouraged to use the portal to find and exploit suitable data • Services: search, browse, deposit data, link data, user ratings
Using occupational data: Example as a measure of marked social disadvantage Lambert & Gayle (2009) Survey Network 4 June 2009
2(b) Data on educational qualifications • Similar issues arise with the use of educational data • Specialist resources exist which can enhance measures of educational data • Many users aren’t aware of alternative coding schemes or harmonised approaches • GEEDE acts as a service for bringing together and disseminating relevant data resources on educational measures
2(c) Data on ethnicity • We can conceive of similar information resources and data analysis requirements for measures of ethnicity • There are generally fewer published resources / agreed standards in this domain • GEMDE publishes resources but puts more emphasis on understanding complex ethnicity data
…working with ethnicity data in surveys is hard…! - It’s sparse - It’s collinear (e.g. to age, location) - It’s dynamic (cf. comparative research)
EFFNATIS sample (1999): Subjective ethnic identity [Heckman et al., 2001]
A ‘data management’ contribution • Preserve information on what was done with categorical data • Communicate information on what should/could be done
GEMDE seeks to promote replicability / transparency… • Document your own recodes • Access somebody else’s recodes • Identify commonly used recodes (& use them..!)
..and making complex analysis of ethnicity data easier.. • Organising complex categorical data • Labelling, recoding, etc • Effect proportional scaling • Standardisation • Interaction terms
The GEODE model for GEMDE? • ….A service for MUGs and MIRs… • Define/register ‘Minority Unit Groups’ • Define/register ‘Minority Information Resources’ • Explore data resources and obtain help in approaching analysis of complex, sparse data
What's a MIR? • 'Minority Information Resource'. • This is our own terminology. By a MIR, we mean any piece of information which supplies systematic data on a minority unit group (MUG) classification. We've used this term to be deliberately similar to the phrase 'Occupational Information Resources' that we used on GEODE • E.g. summary statistical data about the categories from and documentation or information • E.g. recodings which have been used in a particular study • Social scientists are not in general aware of the existence of MIRs (cf. wides use of popular Occupational Information Resources). In GEMDE we seek to publicise little know resources and promote their uptake: We argue that better communication and dissemination of MIRs is in fact an important step towards better scientific practice of replication and standardisation of research. • In our terms, every MIR necessarily links to a MUG (but not every MUG has a MIR).
The GEMDE portal‘Liferay portal’ with access to MUGs and MIRs, first release Jan 2010, now available for general use (www.dames.org.uk/gemde) • Shibboleth access for registered users • Guest level access • Deposit MUGs/MIRs • Search/browse deposited resources • Feedback on resources (user ratings) • Review live data (e.g. pooled LFS records) • Expert and user quality ratings
Summary: Remind me how these topics enhance survey data..? • Variable operationalisations can ordinarily be improved by more ‘rigour and vigour’ • More transparent operationalisation/documentation • Better use of detailed data • Better ability to include measures in suitably complex models/analysis • The GESDE approach has been to seek technological solutions to the organisation and distribution of complex variable-related information
Data used • Department for Education and Employment. (1997). Family and Working Lives Survey, 1994-1995 [computer file]. Colchester, Essex: UK Data Archive [distributor], SN: 3704. • Heckmann, F., Penn, R. D., & Schnapper, D. (Eds.). (2001). Effectiveness of National Integration Strategies Towards Second Generation Migrant Youth in a Comparative Perspective - EFFNATIS. Bamberg: European Forum for Migration Studies, University of Bamberg. • Li, Y., & Heath, A. F. (2008). Socio-Economic Position and Political Support of Black and Ethnic Minority Groups in the United Kingdom, 1972-2005 [computer file]. 2nd Edition. Colchester, Essex: UK Data Archive [distributor], SN: 5666. • Office for National Statistics. Social and Vital Statistics Division and Northern Ireland Statistics and Research Agency. Central Survey Unit, Quarterly Labour Force Survey, January - March, 2008 [computer file]. 4th Edition. Colchester, Essex: UK Data Archive [distributor], March 2010. SN: 5851. • University of Essex, & Institute for Social and Economic Research. (2009). British Household Panel Survey: Waves 1-17, 1991-2008 [computer file], 5th Edition. Colchester, Essex: UK Data Archive [distributor], March 2009, SN 5151.
References • Altman, M., & Franklin, C. H. (2010). Managing Social Science Research Data. London: Chapman and Hall. • Bennett, T., Savage, M., Silva, E. B., Warde, A., Gayo-Cal, M., Wright, D., et al. (2009). Culture, Class, Distinction. London: Routledge. • Blanden, J., Goodman, A., Gregg, P., & Machin, S. (2004). Changes in generational mobility in Britain. In M. Corak (Ed.), Generational Income Mobility in North America and Europe (pp. 147-189). Cambridge: Cambridge University Press. • Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. • Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 153-171. • Ganzeboom, H. B. G. (2008). Tools for deriving status measures from ISKO-88 and ISCO-68. Retrieved 1 March, 2008, from http://home.fsw.vu.nl/~ganzeboom/PISA/ • Guveli, A. (2006). New Social Classes within the Service Class in the Netherlands and Britain: Adjusting the EGP class schema for the technocrats and the social and cultural specialists. Nijmegen: Radbound U. Nijmegen. • Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. NY: Wiley. • Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers. • Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage. • Kulas, J. T. (2008). SPSS Essentials: Managing and Analyzing Social Sciences Data New York: Jossey Bass. • Lambert, P. S., & Gayle, V. (2009). Data management and standardisation: A methodological comment on using results from the UK Research Assessment Exercise 2008. Stirling: University of Stirling, Technical paper 2008-3 of the Data Management through e-Social Science research Node (www.dames.org.uk). • Lambert, P. S., & Gayle, V. (2009). 'Escape from Poverty' and Occupations. Colchester, Essex: BHPS Research Conference, 9-11 July 2009, and www.iser.essex.ac.uk/events/conferences/bhps-2009-conference/overview • Lambert, P. S., & Penn, R. D. (2001). SOR models and Ethnicity data in LIS and LES : Country by Country Report. Syracuse University, Syracuse, New York 13244-1020: Luxembourg Income Study Paper No. 260. • Levesque, R., & SPSS Inc. (2010). Programming and Data Management for IBM SPSS Statistics 18: A Guide for PASW Statistics and SAS users. Chicago: SPSS Inc. • Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. • Penn, R. D., & Lambert, P. S. (2009). Children of International Migrants in Europe: Comparative Perspectives. Basingstoke: Palgrave. • Prandy, K. (1979). Ethnic discrimination in employment and housing. Ethnic and Racial Studies, 2(1), 66-79. • Schneider, S. L. (2008). The International Standard Classification of Education (ISCED-97). An Evaluation of Content and Criterion Validity for 15 European Countries. Mannheim: MZES. • Simpson, L., & Akinwale, B. (2006). Quantifying Stablity and Change in Ethnic Group. Manchester: University of Manchester, CCSR Working Paper 2006-05. • Rose, D., & Harrison, E. (Eds.). (2010). Social Class in Europe: An Introduction to the European Socio-economic Classification London: Routledge. • Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass. • van Deth, J. W. (2003). Using Published Survey Data. In J. A. Harkness et. a.l. (2003) (pp. 329-346).
Appendix Existing resources – sources and types of support for data management in the social sciences: