Dr Paul Lambert and Professor Vernon Gayle University of Stirling

Key Variables: Social Science Measurement and Functional Form Presentation to: ‘Interpreting results from statistical modelling – A seminar for Scottish Government Social Researchers”, Edinburgh, 1 April 2009 Dr Paul Lambert and Professor Vernon Gayle University of Stirling Key variables

Key Variables: Social Science Measurement and Functional Form Key variables

‘Beta’s in Society’ and ‘Demystifying Coefficients’ • Dorling, D., & Simpson, S. (Eds.). (1999). Statistics in Society: The Arithmetic of Politics. London: Arnold. • Irvine, J., Miles, I., & Evans, J. (Eds.). (1979). Demystifying Social Statistics. London: Pluto Press. • Famous works on critical interpretation of social statistics tend to have a univariate / bivariate focus • Measuring unemployment; averaging income; bivariate significance tests; correlation v’s causation • But social survey analysts usually argue that complex multivariate analyses are more appropriate.. • Critical interpretation of joint relative effects • Attention to effects of ‘key variables’ in multivariate analysis Key variables

“A program like SPSS .. has two main components: the statistical routines, .. and the data management facilities. Perhaps surprisingly, it was the latter that really revolutionised quantitative social research”[Procter, 2001: 253] • “Socio-economic processes require comprehensive approaches as they are very complex (‘everything depends on everything else’). The data and computing power needed to disentangle the multiple mechanisms at work have only just become available.”[Crouchley and Fligelstone 2004] Key variables

Large scale survey data: 2 technological themes • We’re data rich (but analysts’ poor) • Plenty of variables (a thousand is common) • Plenty of cases • We work overwhelmingly through individual analysts’ micro-computing • impact of mainstream software • Pressure for simple / accessible / popular analytical techniques (whatever happened to loglinear models?) • Propensity for simple ‘data management’ • Specialist development of very complex analytical packages for very simple sets of variables Key variables

Survey research: Access, manipulate & analyse patterns in variables (‘variable by case matrix’) Key variables

Critical role of syntactical records in working with data & variables Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle Cf. Dale 2006; Freese 2007 • In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples: www.longitudinal.stir.ac.uk Key variables

Stata syntax example (‘do file’) Key variables

Some comments on survey analysis software for analysing variables.. • Data management and data analysis must be seen as integrated processes • Stata is the most effective software, as it combines advanced data management and data analysis functionality and makes good documentation easy • For an extended example of using Stata, concentrating on variable operationalisations and standardisations: • Lambert, P. S., & Gayle, V. (2009). Data management and standardisation: A methodological comment on using results from the UK Research Assessment Exercise 2008. Stirling: University of Stirling, Technical paper 2008-3 of the Data Management through e-Social Science research Node (www.dames.org.uk) E.g. “do http://www.dames.org.uk/rae2008/uoa0108recode.do” E.g. “use http://www.dames.org.uk/rae2008/rae2008_3.dta, clear” Key variables

Working with variables and understanding ‘variable constructions’ • Meaning? • Coding frames; re-coding decisions; metric transformations and functional forms; relative effects in multivariate models • Data collection and data analysis • Cf. www.longitudinal.stir.ac.uk/variables/ • processes by which survey measures are defined and subsequently interpreted by research analysts Key variables

β’s - Where’s the action? • If we have lots of variables, lots of cases, yet often quite simple techniques and software, the action is primarily in the variable constructions… • The example of social mobility research – see Lambert et al. (2007) • How we chose between alternative measures • How much data management we try (or bother with) Plus other issues in how we analyse & interpret the coefficients from the models we use (..elsewhere today..) Key variables

i) Choosing measures See (2) below • A sensible starting point is with ‘key variables’ • Approaches to standardisation / harmonisation • {Lack of} awareness of existing resources See (3) below • Influence of functional form Key variables

ii) Data management – e.g. recoding data Key variables

ii) Data management – e.g. Missing data / case selection Key variables

ii) Data management – e.g. Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk Key variables

Aspects of data management… • Manipulating data • Recoding categories / ‘operationalising’ variables • Linking data • Linking related data (e.g. longitudinal studies) • combining / enhancing data (e.g. linking micro- and macro-data) • Secure access to data • Linking data with different levels of access permission • Detailed access to micro-data cf. access restrictions • Harmonisation standards • Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) • Recommendations on particular ‘variable constructions’ • Cleaning data • ‘missing values’; implausible responses; extreme values Key variables

‘The significance of data management for social survey research’see http://www.esds.ac.uk/news/eventdetail.asp?id=2151 and www.dames.org.uk • The data manipulations described above are a major component of the social survey research workload • Pre-release manipulations performed by distributors / archivists • Coding measures into standard categories • Dealing with missing records • Post-release manipulations performed by researchers • Re-coding measures into simple categories • We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently • So the ‘significance’ of DM is about how much better research might be if we did things more effectively… Key variables

Data Management through e-Social Science (DAMES – www.dames.org.uk) • Supporting operations on data widely performed by social science researchers • Matching data files together • ‘Cleaning’ data • Operationalising variables • Specialist data resources (occupations; education; ethnicity) • Why is e-Social Science relevant? • Dealing with distributed, heterogeneous datasets • Generic data requirements / provisions • Lack of previous systematic standards (e.g. metadata; security; citation procedures; resources to review/obtain suitable data) Key variables

Working with variables – further issues • Re-inventing the wheel • …In survey data analysis, somebody else has already struggled through the variable constructions your are working on right now… • Increasing attention to documentation and replicability [cf Dale 2006; Freese 2007] • Guidance and support • In the UK, use www.esds.ac.uk • Most guidance concerns collecting & harmonising data • Less is directed to analytically exploiting measures Key variables

Key variables and social science measurement Defining ‘key variables’ • Commonly used concepts with numerous previous examples • Methodological research on best practice / best measurement [cf. Stacey 1969; Burgess 1986] ONS harmonisation ‘primary standards’ http://www.statistics.gov.uk/about/data/harmonisation/primary_standards.asp Key variables

Key variables: concepts and measures Key variables

Key variables –Standardisation • Much attention to key variables involves proposing optimum / standard measures • UK – ONS Harmonisation • EU – Eurostat standards • Studies of ‘criterion’ and ‘construct’ validity • Standardisation impacts other analyses • Affects available data • Affects popular interpretations of data Key variables

“a method for equating conceptually similar but operationally different variables..”[Harkness et al 2003, p352] • Input harmonisation[esp. Harkness et al 2003] ‘harmonising measurement instruments’ [H-Z and Wolf 2003, p394] • unlikely / impossible in longer-term longitudinal studies • common in small cross-national and short term lngtl. studies • Output harmonisation (‘ex-post harmonisation’) ‘harmonising measurement products’ [H-Z and Wolf 2003, p394] Key variables

More on harmonisation [esp. HZ and Wolf 2003, p393ff] • Numerous practical resources to help with input and output harmonisation • [e.g. ONS www.statistics.gov.uk/about/data/harmonisation ; UN / EU / NSI’s; LIS project www.lisproject.org; IPUMS www.ipums.org ] • [Cross-national e.g.: HZ & Wolf 2003; Jowell et al. 2007] • Room for more work in justifying/ understanding interpretations after harmonisation Key variables

“the degree to which survey measures or questions are able to assess identical phenonema across two or more cultures”[Harkness et al 2003, p351] Key variables

“Equivalence is the only meaningful criterion if data is to be compared from one context to another. However, equivalence of measures does not necessarily mean that the measurement instruments used in different countries are all the same. Instead it is essential that they measure the same dimension. Thus, functional equivalence is more precisely what is required” [HZ and Wolf 2003, p389] Key variables

Harmonisation & equivalence combined • ‘Universality’ or ‘specificity’ in variable constructions Universality: collect harmonised measures, analyse standardised schemes Specificity: collect localised measures, analyse functionally equivalent schemes • Most prescriptions aim for universality • But specificity is theoretically better • Specificity is more easily obtained than is often realised • Especially for well-known ‘key variables’ Key variables

Working with key variables - speculation a) Data manipulation skills and inertia • I would speculate that around 80% of applications using key variables don’t consult literature and evaluate alternative measures, but choose the first convenient and/or accessible variable in the dataset • Data supply decisions (‘what is on the archive version’) are critical • Much of the explanation lies with lack of confidence in data manipulation / linking data • Too many under-used resources – cf. www.esds.ac.uk Key variables

Working with key variables – speculationb) Endogeneity and key variables • ‘everything depends on everything else’ [Crouchley and Fligelstone 2004] • We know a lot about simple properties of key variables • Key variables often change the main effects of other variables • Simple decisions about contrast categories can influence interpretations • Interaction terms are often significant and influential • We have only scratched the surface of understanding key variables in multivariate context and interpretation • Key variables are often endogenous (because they are ‘key’!) • Work on standards / techniques for multi-process systems and/or comparing structural breaks involving key variables is attractive Key variables

An example: Occupations • In the social sciences, occupation is seen as one of the most important things to know about a person • Direct indicator of economic circumstances • Proxy Indicator of ‘social class’ or ‘stratification’ • Projects at Stirling (www.dames.org.uk) • GEODE – how social scientists use data on occupations • DAMES – extending GEODE resources Key variables

Stage 1 - Collecting Occupational Data (and making a mess)

www.geode.stir.ac.uk/ougs.html Key variables

Occupations: we agree on what we should do: • Preserve two levels of data • Source data: Occupational unit groups, employment status • Social classifications and other outputs • Use transparent (published) methods[i.e. OIR’s] • for classifying index units • for translating index units into social classifications for instance.. • Bechhofer, F. 1969. 'Occupations' in Stacey, M. (ed.) Comparability in Social Research. London: Heinemann. • Jacoby, A. 1986. 'The Measurement of Social Class' Proceedings from the Social Research Association seminar on "Measuring Employment Status and Social Class". London: Social Research Association. • Lambert, P.S. 2002. 'Handling Occupational Information'. Building Research Capacity 4: 9-12. • Rose, D. and Pevalin, D.J. 2003. 'A Researcher's Guide to the National Statistics Socio-economic Classification'. London: Sage.

…in practice we don’t keep to this... Inconsistent preservation of source data • Alternative OUG schemes • SOC-90; SOC-2000; ISCO; SOC-90 (my special version) • Inconsistencies in other index factors • ‘employment status’; supervisory status; number of employees • Individual or household; current job or career Inconsistent exploitation of Occupational Information • Numerous alternative occupational information files • (time; country; format) • Substantive choices over social classifications • Inconsistent translations to social classifications – ‘by file or by fiat’ • Dynamic updates to occupational information resources • Strict security constraints on users’ micro-social survey data • Low uptake of existing occupational information resources

GEODE provides services to help social scientists deal with occupational information resources • disseminate, and access other, Occupational Information Resources • Link together their (secure) micro-data with OIR’s Key variables

Occupational information resources: small electronic files about OUGs…

For example: ISCO-88 Skill levels classification Key variables

and: UK 1980 CAMSIS scales and CAMCON classes Key variables

Existing resources on occupations Popular websites: • http://www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/ • http://home.fsw.vu.nl/~ganzeboom/pisa/ • www.iser.essex.ac.uk/esec/ • www.camsis.stir.ac.uk/occunits/distribution.html Emerging resource: http://www.geode.stir.ac.uk/ Some papers: • Chan, T. W., & Goldthorpe, J. H. (2007). Class and Status: The Conceptual Distinction and its Empirical Relevance. American Sociological Review, 72, 512-532. • Rose, D., & Harrison, E. (2007). The European Socio-economic Classification: A New Social Class Scheme for Comparative European Research. European Societies, 9(3), 459-490. • Lambert, P. S., Tan, K. L. L., Gayle, V., Prandy, K., & Bergman, M. M. (2008). The importance of specificity in occupation-based social classifications. International Journal of Sociology and Social Policy, 28(5/6), 179-192. Key variables

Using data on occupations – further speculation • Growing interest in longitudinal analysis and use of longitudinal summary data on occupations • Intuitive measures (e.g. ever in Class I) • Lampard, R. (2007). Is Social Mobility an Echo of Educational Mobility? Sociological Research Online, 12(5). • Empirical career trajectories / sequences • Halpin, B., & Chan, T. W. (1998). Class Careers as Sequences. European Sociological Review, 14(2), 111-130. • Growing cross-national comparisons • Ganzeboom, H. B. G. (2005). On the Cost of Being Crude: A Comparison of Detailed and Coarse Occupational Coding. In J. H. P. Hoffmeyer-Zlotnick & J. Harkness (Eds.), Methodological Aspects in Cross-National Research (pp. 241-257). Mannheim: ZUMA, Nachrichten Spezial. • Treatment of the non-working populations • Seldom adequate to treat non-working as a category • ‘Selection modelling’ approaches expanding Key variables

Occupations as key variables • Extensive debate about occupation-based social classifications • Document your procedures.. • ..as you may be asked to do something different.. • When choosing between occupation-based measures… • They all measure, mostly, the same things • Don’t assume concepts measure measures • Lambert, P. S., & Bihagen, E. (2007). Concepts and Measures: Empirical evidence on the interpretation of ESeC and other occupation-based social classifications. Paper presented at the ISA RC28 conference, Montreal (14-17 August), www.camsis.stir.ac.uk/stratif/archive/lambert_bihagen_2007_version1.pdf . Key variables

‘Functional form’ The way in which measures are arithmetically incorporated in analysis • Level of measurement (nominal, ordinal, interval, ratio) • Alternative models and link functions • Other variables and interaction effects Key variables

a) Levels of measurement and the desire to categorise • Categories are easier to envisage / communicate • Much harmonisation work ≡ locating into categories • Appearance of measurement equivalence • But functional equivalence is seldom achieved • Metrics are better for functional equivalence • E.g. Standardised income • How to deal with categorisations? • The qualitative foundation of quantity [Prandy 2002a] Key variables

Example: categorisation and the scandalous use of collapsed EGP/NS-SEC…! • Ignores heterogeneity within occupations • Defines and hinges on arbitrary boundaries • Creates artefactual gender differences Key variables

The scaling alternative… • Many concepts can be reasonably regarded as metric • cf. simplified / dichotomisted categorisations • Comparability / standardisation is easier with scales • Complex / Multi-process systems are easier with scales • Structural Equation Models • Interaction effects • Growing availability/use of distance score techniques • Stereotyped ordered logit [‘slogit’ in Stata] • Correspondence Analysis • Latent variable models • …But, scaling seems to be seen by some as a wicked, positivistic activity..! Key variables

Practical suggestions on the level of measurement • It’s rare not to have a few alternative measures of the same concepts at different levels of measurement Good practice would be to • try alternative measures and see what difference they make • consider treatment of missing values in relation to measurement instrument choice • Engage as much as possible with other studies Key variables

b) Alternative models and link functions • The functional form of the outcome variable(s) is of greatest importance (influences which model is used) • ‘Link functions’ perform the maths to allow for alternative functional forms of the outcome variable • See [Talk 1] for popular alternative models Key variables

Practical observations on link functions • Social scientists are unduly conservative in choosing between alternative models • [We tend to favour binary or metric outcomes and single process systems] • Substantively, this isn’t ideal • Pragmatically, it’s no longer necessary Key variables

Dr Paul Lambert and Professor Vernon Gayle University of Stirling