Exploring configurational causation in large datasets with QCA: possibilities and problems

Exploring configurational causation in large datasets with QCA: possibilities and problems Barry Cooper & Judith Glaesser School of Education, Durham University 3rd ESRC Research Methods Festival St Catherine’s College Oxford, 30 June – 3 July 2008

A note re these slides. • Some of these slides will be used in our presentation itself but some have been written to provide, as a context for the tables, etc., a pre- and post-festival web-based sketch of the method we have employed (Ragin’s Qualitative Comparative Analysis, or QCA) for any readers new to it. • After a brief description of the background to Ragin’s development of the set theoretic approach, and a list of what we see as its strengths, we will illustrate its use with large n data, drawing on our experience of using QCA (Cooper, 2005, 2006; Cooper & Glaesser, 2007, 2008, in press; Glaesser, forthcoming). • To keep things less complex than they would otherwise become, we will not draw attention, during this part of our presentation, to the more problematic issues that we wish to mention. • Instead, we deal with this aspect of our presentation after the illustration of the use of QCA in a large n context.

Concerns about the dominant regression approach in quantitative analysis havea long history.Here, for example, are various remarks taken from Peter Abell’s 1971 book, Model Building in Sociology: • It is often (perhaps more often than not) the case that the covariation between sociological variables is not linear (p.174). • It was argued ... that interaction is a characteristic feature of sociological covariation (p.183). • Multicollinearity is pervasive in sociology; it is more often than not the case that explanatory variables are intercorrelated (p.189). • But from what was said earlier it might be expected that … (cardinal) variables will be of relatively rare occurrence in sociology. One is much more likely to encounter the situation where nominal and ordinal variables are related (p.197). • We have noted earlier that the typical causal situation in social science is one of over-determination – many different clusters of variables are sufficient for a given effect (p.236). Abell’s book also includes considerable discussion of the logic of necessary and sufficient conditions alongside his discussion of linear modelling.

Several authors, from various perspectives, have raised important concerns about regression and its uses. For example (see attached bibliography for details): • Boudon (1974a,b) • Byrne (1998, 2002) • Freedman (1987, 1997) • Hedström (2005) • Lieberson (1985) • Morgan and Winship (2007) • Ormerod (1998) • Pawson & Tilley (1997) • Pearl (2000) • Ron (2002) • Sörensen (1998) • Taagepera (2005).

Andrew Abbott (2001) has summarised some of the key assumptions of the linear model normally used in regression: • The social world is made up of fixed entities with varying attributes (demographic assumption). • Some attributes determine (cause) others (attribute causality assumption). • What happens to one case doesn't constrain what happens to others, temporally or spatially (casewise independence assumption). • Attributes have one and only one causal meaning within a given study (univocal meaning assumption). • Attributes determine each other principally as independent scales rather than as constellations of attributes; main effects are more important than interactions (which are complex types) (main effects assumption).

Charles Ragin’s work Ragin (1987) shared many of the concerns of these various writers, but, in particular perhaps, focussed on Abbott’s third and fourth points, the relative neglect of causal heterogeneity and complex interaction in regression models when used in practice[1]. Using set theory rather than regression’s linear algebra as the basis for developing a configurational approach to causal modelling, he began to explore ways in which (i) complex interaction between causal factors and (ii) causal heterogeneity (i.e. the existence of several distinct types of cases in a ‘population’[2] and therefore of possible multiple pathways to an outcome) could be described in Boolean or configurational terms (Ragin, 1987, 2000, 2006a). In doing so, he also aimed to shift researchers’ practices away from a focus on the net average effects of variables (i.e. on which variables win the race to explain most variance) and towards an approach that recognised that events in the world are often caused by conjunctions of factors (Ragin, 2006b). It is his Qualitative Comparative Analysis (QCA) on which we focus in this paper. [1] On Abbott’s second point, see Hedström (2005). [2] The returns to cognitive capacity, for example, might differ systematically between social classes.

Before introducing QCA in more detail, we might set out what we regard as the strengths of Ragin’s approach: • A focus on cases and their constituent features rather than, as in regression, on abstracted variables (and therefore net – and often average – effects). • Analysis of multiple and conjunctural causation in terms of necessary and/or sufficient conditions rather than in terms of the linear additive model. • The recognition, up front, of the possibility of causal heterogeneity. • The offer of a rigorous approach, drawing on set theory and logic, to the analysis of these features of social reality. • Through a focus on INUS[1] conditions, the allowing, up front, of complex interactions between causes. • The recognition of the problems resulting from limited diversity in social datasets. [1] An INUS condition is “an insufficient but non-redundant part of an unnecessary but sufficient condition” (Mackie, 1974).

Boolean functional form: an example • Ragin’s QCA and its associated software use Boolean algebra to address conjunctural causation. Boolean equations have a different functional form to the regression equations with which social scientists are familiar. Here is an example taken from a paper contrasting the approaches (Mahoney & Goertz, 2006): • Y = (A*B*c) + (A*C*D*E) • In these equations the symbol * indicates Logical AND (set intersection),+ indicates Logical OR (set union), upper case letters indicate the presence of factors, lower case indicate their absence. In this fictional example of causal heterogeneity, the equation indicates that there are two causal paths to the outcome Y. The first, captured by the causal configurationA*B*c involves the presence in the case of features A and B, combined with the absence of C. The second, captured by A*C*D*E, requires the joint presence of A, C, D and E. Either of these causal configurations is sufficient for the outcome to occur, but neither is necessary, considered alone. A is necessary but not sufficient. The factor C behaves differently in the two configurations. This non-probabilistic - or veristic - example, of course, assumes no empirical exceptions to these relations.

QCA: Sufficiency and quasi-sufficiency Sufficiency, understood causally or logically, involves a subset relation. If, for example, a single condition is always sufficient for an outcome to occur, the set of cases with the condition will be a subset of the set of cases with the outcome. This is shown in Figure 1 (next slide) based on a hypothetical relation between being of service class origin and achieving a degree. Given the condition, we obtain the outcome. In applications to real large n data, perfect sufficiency is unlikely to be found, and a situation like Figure 2 (next slide) will often be found, where most but not all of the set of cases with the condition also are members of the outcome set. Using conventional crisp sets, the proportion of the members of the condition set who are also members of the outcome set can be used as a measure of the degree of consistency of the empirical relation with a relation of perfect sufficiency (here: the number in the yellow subset divided by the number in the yellow and green subsets taken together). Figure 2 illustrates a relation that might be described as only ‘nearly always sufficient’. Alternatively, using a probabilistic view of causation, being of service class origin here could be said to be a sufficient condition, all else being equal, for raising the probability of achieving the outcome to a level equal to this “consistency” proportion.

Figure 1: Perfect Sufficiency Figure 2: Quasi-Sufficiency

QCA: Necessity & Coverage In Figure 3 (next slide), another hypothetical relation between being of service class origin and achieving a degree is shown. This is another example of less than perfect sufficiency. Here the members of the yellow fringe of the service class origin set are not also members of the outcome set. However, most members of this condition set are. This example is also, in fact, a special case in that being of service class origin is a necessary condition for achieving a degree (and in the case of necessity the outcome set is, as can be seen, a subset of the condition set, reversing the direction of the subsethood relation that characterises sufficiency). Venn diagrams can also illustrate Ragin’s concept of explanatory coverage (Ragin, 2006a). The proportion of the outcome set that is overlapped by the condition set can be used as a measure of the degree to which the outcome is covered (‘explained’) by the condition. In Figure 1 (previous slide), the coverage of the outcome of having a degree by the condition of being of service class origin can be seen to be low, with only around 40% of the (blue) outcome set covered by the (yellow) condition set. In Figure 3 (next slide), on the other hand, it can be seen that the whole of the outcome set (again in blue) is covered by the (yellow) condition set, and coverage is 100% (the arithmetic mark of a necessary condition in this simple case).

QCA: Multiple conditions and the partitioning of coverage: I In more complex set theoretic models with more than one condition, coverage can be partitioned in a manner analogous to the partitioning of variance explained in regression-based approaches (Ragin, 2006a). The partitioning of coverage into raw and unique components can be illustrated, again using imaginary data, by reference to a more complex Venn diagram (Figure 4, next slide). Here we have added the condition of being of high ability. In this fictional case we now have two crisp sets representing the conditions, ‘SERVICE CLASS ORIGIN’ and ‘HIGH ABILITY’, and the outcome is the achievement of a degree. The Boolean solution can be written as DEGREE = SERVICE CLASS ORIGIN + HIGH ABILITY. Either being of service class origin or of high ability is sufficient for the outcome (since both condition sets, considered separately, are subsets of the outcome set). Greater coverage of the outcome is achieved by having both of these factors in the analysis rather than either alone.

QCA: Multiple conditions and the partitioning of coverage: II We can also see here how coverage can be partitioned straightforwardly in the case of crisp sets. In the case of the relations illustrated in Figure 4 (previous slide) it is easy to see that the total coverage can be broken into three components: • That due to being of service class origin while not being of high ability (the yellow subset as a proportion of the blue outcome set) • That due to being of high ability while not being of service class origin (the orange subset as a proportion of the blue outcome set) • That due to being of service class origin and being of high ability (the red subset as a proportion of the blue outcome set). If we take service class origin as an example, Ragin (2006a) would describe the first of these three (the yellow subset as a proportion of the outcome set) as the unique coverage due to being from this social class background. On the other hand, the coverage due to being of this class origin, whether or not this is conjoined with other causal conditions in the model (the yellow and red subsets taken together as a proportion of the outcome set), he would describe as the raw coverage due to membership in this set (being of service class origin). Parallel arguments apply to being of high ability.

From this point on we employ real large n data in illustrating QCA in use. We can use data from the National Child Development Study (NCDS), comprising children born in one week in March 1958, to illustrate a multifactor conjunctural explanation[1]. Of course, we will not expect to find perfect sufficiency in the empirical world and our example will show how the method embodied in the software addresses this problem. We explore the relations between highest qualifications achieved by age 33 and a number of factors which might be seen as either causal or as summarising possible causes of achievement. To begin with we will take, as our outcome measure, having a highest level of qualification of at least ‘A’ level or its equivalent (HQUAL_ADVANCED). We wish to capture something more, when referring to social class origin, than one point in time, and so, for illustrative purposes, we will take father’s[2] social class at two points. We also include a measure of mother’s education and sex of the respondent. We will not include any measure of ability in this first example, in order to keep things simpler. [1] We will begin by using a subset of the data containing 3826 cases chosen to include no missing values on four measures of father’s class at different times and on mother’s education as well as other key variables. [2] We use father’s class because there are many more cases of missing/not-applicable data for mother’s class. However, we include a maternal influence via mother’s education.

An illustrative Boolean analysis. We will address the Boolean equation: HQUAL_ADVANCED = function(MALE, PMT_FATHER_AT_BIRTH[1], PMT_FATHER_AT_AGE_11, MOTHER_POST_16_EDUCATED) where: HQUAL_ADVANCED refers to having qualifications of at least ‘A’ level standard by age 33. MOTHER_POST_16_EDUCATED refers to the mother having stayed on in education after age 16. MALE refers to being male rather than female. PMT_FATHER_AT_BIRTH refers to the mother’s husband being in a professional, managerial or technical position[2] at the time of the respondent’s birth. PMT_FATHER_AT_AGE_11 refers to the respondent’s father being in a professional, managerial or technical position when the respondent was aged 11. We should stress that we are not claiming that we have anything like a properly specified model of educational achievement here. Our purpose here is to illustrate QCA in use with large n data. [1] This is actually a measure of the mother’s husband in 1958, but to avoid unnecessary complexity (and given that this is usually the respondent’s father) we have used this description. [2] The PMT grouping used here comprises Classes I and II of thecontemporary Registrar General’s scheme.

Table 1: Proportions achieving HQUAL_ADVANCED by class origin, sex and mother’s education (NCDS data; n=3826) : a crosstabulation

QCA: Moving from the crosstab via a truth table to a Boolean solution The first step required is to reconfigure this as a truth table (next slide) where a “1” is entered to indicate the presence of a condition and a “0” to indicate its absence. In this table, where the rows are ordered by the measure of consistency with sufficiency, the first row (1101), for example, represents the causal configuration: MALE*PMT_FATHER_AT_BIRTH*pmt_father_at_age_11*MOTHER_POST_16_EDUCATED with the upper case letters indicating membership in a set and lower case letters non-membership. The proportion of the 34 cases in this configuration who achieve the outcome, i.e. 0.824, appears in the consistency column. The second step is to determine a threshold for quasi-sufficiency and, in the light of this decision, to enter a “1” into the empty outcome (HQUAL_ADVANCED) column against each row (or causal configuration) for which the consistency proportion in the final column passes the threshold set. This decision determines which configurations are allowed into the final solution.

Table 2: Truth table for achieving HQUAL_ADVANCED (NCDS data, n=3826)

Three types of cases? The decision re a threshold also effectively determines which cases, seen as captured by configurations of conditions, will be grouped together in the final solution. In this illustration we will assume that there are three levels of outcome that we wish to understand in configurational terms: • Those configurations – or sets of cases – in which more than 60% of the cases achieve the outcome. Passing this consistency level might be argued to be consistent with this level of outcome approaching being more or less the norm for these configurations. These configurations are also those we might want to allow forward into a solution for quasi-sufficiency. • Those configurations (sets of cases) in which fewer than 40% of the cases achieve the outcome. This level might be seen as making not achieving this level of outcome more or less the norm for these configurations. • The remaining configurations (sets of cases) in which 40% - 60% of the cases achieve the outcome. In these configurations neither achieving nor not achieving the outcome is the norm. Clearly, these decisions require judgements to be made. The reader will see that it is easy to explore other analyses based on other boundaries.

The first group of cases. Let us turn to the first group. These configurations have been picked out by entering 1s and 0s in Table 2 in the HQUAL_ADVANCED column. Table 3a (next slide) shows the solution that results when fs/QCA is asked to minimise the configurations picked out by these 1s. These eight rows (‘causal configurations’) are subjected to an algebraic process of Boolean minimisation[1] (Quine, 1952; Ragin, 1987) in order to create the final simplest solution: MALE*PMT_FATHER_AT_BIRTH + PMT_FATHER_AT_BIRTH*MOTHER_POST_16_EDUCATED+ PMT_FATHER_AT_AGE_11* MOTHER_POST_16_EDUCATED The two final expressions pick out cases whose mothers had stayed on after 16 and had a father figure in the PMT class at one point of two in their childhood. Both males and females are included in these expressions. The first expression picks out just males who were born into a family setting with a father in the PMT class at birth. [1] This proceeds as follows. Taking the first two rows as an example, we have 1101 and 1111. Clearly, at the level of quasi-sufficiency we have chosen the presence or absence of the third element makes no difference. We can therefore replace it with a dash to indicate this, giving 11-1. A similar argument can be applied to the fourth and fifth rows (0111 and 0101) to give 01-1. Taking 11-1 and 01-1 together, and continuing the process we arrive at -1-1. This is PMT_FATHER_AT_BIRTH*MOTHER_POST_16_EDUCATED, one of the terms in our final solution.

QCA: an example of a quasi-necessary condition: I It might be thought, at least for some hypothesised meritocracy, that were academic ability to be appropriately defined and measured then some minimum level of this factor ought to be a necessary condition for anyone to achieve a degree. Table 4a illustrates this, where one cell should be empty if the chosen level of ability (X) is a strictly necessary condition for a degree to be achieved. Here, we might be seen as assuming causal homogeneity for the factor of ability. Table 4a: Strict necessity of some level of ability (X) for achieving a degree

QCA: an example of a quasi-necessary condition: II An examination by eye of the NCDS distribution of the proportions achieving a degree at each point of the ability scale allows us to estimate what such a level of ability might be empirically, for all respondents taken together. It is, in fact, around the mean ability score and if we create a factor setting ability as either over or under the mean score for our subset of 3826, we obtain Table 4b, showing that the proportion of those obtaining a degree whose ability score is below the mean is only 10.4%. Especially given that this proportion may include cases where the measurement was low through either error or chance factors, we might be willing to say that a score above the mean approaches being a necessary condition for achieving a degree in this sample and is therefore a quasi-necessary condition. Table 4b: Achieving a degree by ability below and above the mean row (column %)[1] [1] As it happens this test only has discrete scores, from 0 to 80. The mean lies between two of these scores.

QCA: an example of a quasi-necessary condition: III However, we can not be satisfied with this conclusion which, as we said, effectively assumes causal homogeneity, with ability operating in the same way across all types of cases and, of course, leaves us wondering about the features of the cases amongst the 10.4%. We obviously want to know whether there are sets of cases – perhaps, for example, differentiated by social class - for whom being either above or below the mean, when conjoined with other factors, is either necessary and/or sufficient or not for achieving a degree (or quasi-necessary or quasi-sufficient), especially as apparent returns to ability vary by class, as Figure 5 (next slide), produced using a slightly different class origin categorisation, clearly shows.

Figure 5: Proportions gaining a degree by ability at age 11 and social class

QCA: an example of a quasi-necessary condition: IV • To explore these questions, we might undertake an analysis that includes a measure of ability being over the mean, given what we found in Table 4b. Let us undertake an analysis of: HQUAL_DEGREE = function (ABILITY_ABOVE_MEAN, MALE, PMT_FATHER_AT_BIRTH, PMT_FATHER_AT_AGE_11, MOTHER_POST_16_EDUCATED). • The relevant truth table is shown in Table 5 (next slide), with the rows ordered by consistency. We can see that the first five rows have a consistency level of 0.40 or above, which we might label as implying that for these cases, gaining a degree is, all else being equal, a definite possibility, something that is a pretty common occurrence in their milieus. Each of these configurations is characterised by having ability above the mean, but conjoined with several supportive paternal and maternal ascriptive factors, and, in most cases, with male sex. The minimised solution for these rows is shown in Table 6 (two slides on) where ABILITY_ABOVE_MEAN appears, as a necessary condition should, in each expression. • We will return to the somewhat paradoxical threshold-dependent sense which the term “necessary” has in this claim after a subsequent example.

Table 5

Table 6: Minimised solution for Table 5, for first five rows --- TRUTH TABLE SOLUTION --- frequency cutoff: 9.000 consistency cutoff: 0.417 raw unique coverage coverage consistency -------- ---------- ----------- ABILITY_ABOVE_MEAN*MALE*PMT_FATHER_AT_BIRTH *PMT_FATHER_AT_AGE_11+ 0.184 0.065 0.485 ABILITY_ABOVE_MEAN*MALE* PMT_FATHER_AT_BIRTH *MOTHER_POST_16_EDUCATED + 0.141 0.022 0.477 ABILITY_ABOVE_MEAN*MALE*PMT_FATHER_AT_AGE_11 *MOTHER_POST_16_EDUCATED + 0.159 0.039 0.466 ABILITY_ABOVE_MEAN*PMT_FATHER_AT_BIRTH *PMT_FATHER_AT_AGE_11*MOTHER_POST_16_EDUCATED 0.239 0.120 0.452 solution coverage: 0.365 solution consistency: 0.453

QCA: an example of a quasi-necessary condition: V A further inspection of Table 5 shows, as we might expect, that having this level of ability characterises the top half of the ordered table (14 out of the 16 rows). However, there are exceptions. The first, in the twelfth row, is the configuration, with only 34 cases ability_above_mean*MALE*PMT_FATHER_AT_BIRTH *PMT_FATHER_AT_AGE_11*MOTHER_POST_16_EDUCATED This conjunction of lower ability with supportive ascriptive factors is associated with some 20.6% achieving a degree, some way above the mean of 13.3%.

QCA: an example of a quasi-necessary condition: VI We might be especially interested in exploring what it is about those with lower than mean ability that might explain their achieving proportionally more degrees than expected. It is likely, as we can see from this example, to be the presence of supporting ascriptive factors. However, the numbers become very small in some of the relevant rows in Table 5. For this reason, we will explore this question using a different boundary within the ability scale. Sixty-one percent of those achieving degrees in the 3826 have ability in the top 20% of the overall distribution in the NCDS (see Table 7). We can use the remaining 39% to explore what factors, conjoined with being outside the top 20% are associated with raising the proportion gaining a degree. We will define, for current purposes, ability in the top 20% as “high ability”. Table 7: Degrees by High Ability (i.e. ability in top 20%) (column %)

QCA: an example of a quasi-necessary condition: VII Therefore let us undertake a Boolean analysis parallel to the earlier one but that excludes the top 20% of the ability range. Table 8 (next slide) is the relevant truth table, ordered by consistency. A glance at this shows that, for these cases, mother’s education is a key factor in raising the likelihood of a degree. If we set a 0.20 threshold to explore this (having noted the jump from 0.16 to 0.20 in the consistency column), we obtain the solution in Table 9 (two slides on). Within the confines of this analysis, i.e. for those not of high ability as defined, MOTHER_POST_16_EDUCATED is necessary to raise the proportion obtaining a degree to 20%, as is also a father’s class position in the PMT classes for at least one of the two points included. However, the low coverage figure for the solution should be noted (0.296). Amongst those not of high ability as defined, more degrees (140) are gained by individuals outside of the configurations included in this solution than by those within them (59). It must therefore be stressed that the sense of necessary here is necessary to raise the proportion for a configuration to 0.2 or better and not the sense that it is not possible for an individual to gain a degree without a suitably educated mother. Many do precisely the latter.

Table 8: Degree by sex, class and mother’s education (only for those whose ability is outside the top 20%)

Table 9: Degree by sex, class and mother’s education (only for those whose ability is outside the top 20%) --- TRUTH TABLE SOLUTION --- frequency cutoff: 17.000 consistency cutoff: 0.200 raw unique coverage coverage consistency ---------- ---------- ----------- PMT_FATHER_AT_AGE_11 *MOTHER_POST_16_EDUCATED+ 0.276 0.201 0.239 male*PMT_FATHER_AT_BIRTH *MOTHER_POST_16_EDUCATED 0.095 0.020 0.202 solution coverage: 0.296 solution consistency: 0.236

QCA: Limited Diversity in Datasets and Counterfactual Reasoning In the examples we have used above, and with the number of conditions employed in those models, we did not experience the problem of very small numbers in some rows of the truth table that can arise with more conditions as a consequence of (i) the exponential increase in the number of rows as more conditions are included and (ii) the relations – or correlations - between conditions in the empirical world (Ragin & Sonnett, 2005). Small numbers of cases in some configurations constitute a problem because it is difficult to make a valid statement about a group of cases who, empirically, only appear in small numbers. In regression analyses, since the weight of the various combinations of scores on variables is taken into account in calculating average net effects, this problem is effectively dealt with mechanically, partly via the use of significance tests. Ragin has suggested a range of ways of using counterfactual reasoning to address the problems caused by limited diversity. For our use of these approaches with the NCDS data, which we will not have time to discuss, see Cooper & Glaesser (2008).

QCA: Some Problems in its Use With Large Datasets We will introduce here some of the problems and issues that arise for us in using QCA with large n data. We will begin with problems that are not peculiar to QCA since they parallel the correlation / causation problem in conventional quantitative analyses. We will then discuss some problems that are more QCA-specific, though, to some extent, it must be remembered, these may be a consequence of its relatively recent development. Unlike regression, QCA has not been under development for more than a century!

Although we may, and certainly should, have inserted some ‘cautious’ words (‘potentially’, ‘possible’, etc.) before the word causal at various places in this talk, we have not yet addressed the question of whether QCA, as an analytic tool, is able to avoid analogous problems to those associated with moving from correlations to causal claims in the regression approach. Clearly, we might enter into a Boolean model a ‘condition’ that we then found to be logically necessary, for example, for some outcome, but which we would not want to regard as truly causal. • Two types of such conditions are worth distinguishing.

QCA: non-causal conditions I Alcohol might be a necessary (and causal) condition for drunkenness, but, in a society in which it was always mixed with tonic water, we would want to be able to reject a claim (which QCA could obviously deliver, if used mechanically) that tonic water was a necessary causal condition for drunkenness. We would do this, presumably, by reference to existing theoretical knowledge, preferably of the mechanisms and processes involved in the production of drunkenness and/or by comparisons with other sets of findings where tonic water was not mixed with alcohol, etc[1]. [1] Cartwright (2007) provides a formal treatment of this correlation/causation problem in the context of QCA.

QCA: non-causal conditions II To avoid problems of infinite regress, we would want to be able to distinguish some types of causal necessary conditions from others. It may well be necessary for oxygen to be present in order for degrees to be achieved, but we wouldn’t normally expect to address this in an analysis of educational achievement. Mackie’s (1974) concept of the “causal field” provides a way of addressing this potential problem. This field acts as a background context which absorbs the causal factors we would not expect to see referred to as part of an explanation of some particular outcome under examination.

QCA: non-causal conditions III • Having noted these problems, we would nevertheless want to argue that, in our earlier analyses, there are plausible mechanisms implied by such summarising conditions as social class. These conditions (class, ability, etc.) or, at least, the more specific factors they summarise, are plausible causal factors. • Furthermore, when addressing some evaluative questions (e.g. is Britain a meritocracy?), the question itself, once its constituent terms are defined, usually points to the relevant factors to include in a configurational analysis (Cooper, 2005, 2006).

QCA: Underdetermination of theory by data, etc. • We might find in some population that being in the set male*WORKING_CLASS is perfectly sufficient for NOT achieving a given level of educational qualification. • However, whether this is due to working class females lacking some capacity or disposition required to cope with the appropriate curriculum or whether, on the other hand, some form of educational apartheid ensures that no working class female is allowed to enter the institution offering the curriculum, clearly can not be read off from the Boolean expression. • Of course, other Boolean models perhaps could be used to provide part of the answer (exploring what happens to other females, to working class males; including dispositional factors) but, ideally, we need knowledge of the processes and mechanisms that generate the observed outcomes. Nothing in Ragin’s work, we should note, suggests that he thinks otherwise.

QCA: problems to do with randomness We might find that the configuration HIGH_ABILITY * SERVICE_CLASS has a consistency with sufficiency of, say, 0.90, for achieving some outcome, thereby reaching a level that Ragin would regard as indicating quasi-sufficiency. However, is this gap between 1.00 and 0.90 to be explained by our having the equivalent of an underspecified model in a regression analysis (e.g. perhaps some missing ascriptive factors or a lack of factors concerning ‘choice’) or by the existence of stochastic elements in the social world (and/or measurement or sampling error)? In the former case, there exists some causal heterogeneity yet to be picked out by the conditions entered in the model. It might be that HIGH_ABILITY * SERVICE_CLASS * MALE has perfect consistency with sufficiency, for example. This would leave us, however, with HIGH_ABILITY * SERVICE_CLASS * male having a lower consistency than 0.90 and return us to the same question again, but this time just for females.

QCA and counterfactualist perspectives of causation A counterfactualist perspective on causation (e.g. Morgan & Winship, 2007) could be used to raise questions about some QCA-derived claims re causality in the same way it raises questions about some regression-based forms of analysis that basically use a branch of mathematics to describe relations in datasets[1]. On the other hand, a move from a net effects perspective (one assuming independently manipulable independent variables) to one emphasising conjunctural causation might be expected to make it less likely that unjustified counterfactual claims are made by policy makers on the basis of research findings, especially about the effects of intervening to change a single factor without taking account of its context. [1] For a relevant and interesting exchange of views, see Ragin & Rihoux, 2004a,b; Lieberson, 2004; Seawright, 2004; Mahoney, 2004.

More QCA-specific issues: inference from samples to populations I • The first point concerns work that uses samples from some population. This is usually the situation we find ourselves in when working with large datasets. Although attempts have been made (e.g. in earlier version of the fs/QCA software) to incorporate significance testing (see also Ragin, 2000, and Smithson and Verkuilen, 2006), this is an area requiring more work. Especially when numbers become small in some rows of a truth table, and especially when survey data are being used, a critic will always be able to ask whether sampling (or measurement) error has been taken into account. Although we have considerable sympathy with the view that judgement should play a role in these situations – especially as significance tests are frequently employed when the conditions for their use are not met – we also recognise that more work on incorporating significance testing into QCA would be useful, simply because chance always offers a potential threat to any analytic claim we might make. • But, note that Ragin (1987, 2000) has a different perspective on ‘populations’ to the one implied here.

More QCA-specific issues: inference from samples to populations II • A related problem we have ignored during the talk so far is that of missing data. Can we assume that the Boolean solutions we have presented, often based on smallish subsets of the whole NCDS (because of the missing data problem) would hold for the NCDS as a whole? This would seem unlikely unless the missing data have been generated by random rather than systematic processes. • Of course, it is possible to undertake some simple checks to see whether any bias is likely to have been introduced. It is also possible to use sophisticated techniques (multiple imputation, etc.) to replace missing data, but such approaches require considerable faith in the very linear models that Ragin and others have argued are often unhelpful in the social world. This is a difficult problem to which we intend to give further thought.

More QCA-specific issues: case knowledge (or its lack) in large n contexts • We lack, in the traditional sense, the detailed case knowledge that Ragin argues is required to undertake QCA. • The NCDS, in one sense, does contain a mass of data on each individual respondent but, for example, • it is collected via techniques that are likely to generate considerable error and, • (ii) it is not possible for us to return to the respondent to correct likely errors or to seek new data from earlier periods as analyses develop.

More QCA-specific issues: quasi as opposed to perfect necessity and sufficiency • Repeating what we said earlier there is the question of whether and when it makes sense to ever stop at quasi- levels of consistency, i.e. to ignore the deviant cases in a row (or to allow a ceteris paribus clause). • More generally, the use of weak implication (quasi-sufficiency and quasi-necessity as opposed to sufficiency and necessity) deserves more discussion (but see Abell, 1971, and also Goertz, 2005; Waldner, 2005; Sekhon, 2005 for a recent exchange).

Exploring configurational causation in large datasets with QCA: possibilities and problems

Exploring configurational causation in large datasets with QCA: possibilities and problems

Presentation Transcript

UWB Radars: Possibilities and Problems

Working Efficiently with Large SAS® Datasets

Problems and Possibilities of Telework

Challenges in survival analysis with large datasets

Large Truck Crash Causation Study

CS597A: Managing and Exploring Large Datasets

Analysis with Extremely Large Datasets

EXPLORING POSSIBILITIES WITH MULTIGENRE WRITING

EXPLORING POSSIBILITIES WITH MULTIGENRE WRITING

Challenges in Mining Large Image Datasets

Exploring the Possibilities

Exploring Tabular Datasets

Handling Large (Vector) Datasets with MapServer

Issues Sources Problems and possibilities

Clustering Very Large Multi-dimensional Datasets with MapReduce

Best Practices in Loading Large Datasets

Analysis with Extremely Large Datasets

Analyzing Large Datasets in Astrophysics

Challenges in survival analysis with large datasets

Configurational Isomers