480 likes | 721 Views
Effects in Experiments: Simulating the Counterfactual.
E N D
Effects in Experiments: Simulating the Counterfactual • Effects: We define an effect in terms of a notion of the counterfactual. If in an experiment you observe what does happen as a result of the application of a treatment (e.g., condition of the independent variable), the counterfactual is what would have happened if the treatment had simultaneously not been applied (for example, if the same subjects simultaneously had and had not heard the message/seen the video/etc. • Obviously you can’t treat and not treat the same subjects to two conditions simultaneously, so through randomization or matching cases you attempt to simulate the counterfactual by approximating the “not treated” under conditions as close as possible to the “treated.” In short you try to simulate in the experiment the effects of observing the presence and the absence of the condition simultaneously • An effect of the treatment is the difference between what did happen when the treatment was applied and what would have happened had it not been applied
Simulating the Counterfactual, cont’d • Another approach to simulating the counterfactual: Case-control studies • While it is said that the “gold standard” of research is the prospective, double-blind design with random assignment of subjects to conditions of the IV, in some cases it’s not ethically justifiable to do it, nor practically feasible in terms of how long it might take to get an adequate number of cases. • Case-control study is an alternative where you find cases already assigned to a level of the independent variable (people who have diabetes, people who have multiple DUIs, etc) and match (on possible confounding variables) and compare them with controls who don’t have these issues, as if the “nots” were the counterfactual (what would happen if the people who have the condition didn’t have it (counter to fact) –eg, the very same people both with and without diabetes • This type of design is retrospective, not double blind, does not involve random assignment but can still be very powerful. It looks backwards to try to identify the causes of effects that have already occurred
Causal Relationship • Causal Relationship is said to exist if • the putative cause (X) preceded the effect (Y), • the putative cause (X) is associated with the effect (Y) • and other plausible explanations have been ruled out (not just explanations which explain Y, but explanations Z which may explain X, Y or the XY relationship). • Thus given our data from the employment2.sav file, although gender precedes employment category, we could not conclude that gender (X) was the cause of the individual’s job category (Y) despite strong XY association, because we could not rule out the impact of gender on educational attainment (Z), and the subsequent impact of educational attainment on job category
Causal Relationship, cont’d • In an experiment we select variables which are logically/chronologically prior to the dependent variable to be treatment variables (IV) • We observe the effect on the DV of variation or manipulation of the IV ( we note their association) • We attempt to rule out (control for) competing explanations, e.g. identifying confounds which might explain the observed association between the IV and the DV
Nonmanipulable Variables, Analogue Experiments, Causal Description vs. Causal Explanation • The Shadish et al book would not consider gender to be a ‘cause’ in a proper experiment because it can’t be manipulated to see what happens. • They argue that naturally occurring IVs like gender have so many co-variates due to life experience that is a different order of problem to try to find and attribute causes to them • Much stronger inference is possible if you can manipulate IVs, for example, dosage in medical study, word choice in media messages, etc • Analogue experiments: taking a nonmanipulable variable like gender and simulating it, such as dressing a confederate of the experimenter as male or female or even finer gradations of the “femininity” variable • Causal description (being able to show that systematic manipulation of the IV produces consequences for the DV) is different from being able to explain why this effect occurs
Molar vs. Molecular Causation • Molar causation: IV conceptualized and measured at a macro level encompassing all constituent elements, for example, exposure to graphic violence • Enables descriptive causation) • Molecular causation: IV is conceptualized and measured at the micro level of its constituent elements such as level of nervous system arousal, empathy, social learning, etc. • Enables explanatory causation in part by virtue of enabling the detection of interaction effects and limiting conditions
Moderator vs. Mediator Variables • Moderator variable: one which determines the conditions under which a described causal relationship holds (increasing the frequency of broadcast of car commercials in which the dealer himself appears increases his sales among low-income prospective buyers but not among high-income prospective buyers). Effect of the IV depends on the value of the ModV • Mediator variable: a link in the causal chain between IV and DV. For example, educational attainment might be called a mediator variable between gender and job category. A variable is a strong mediator if it has a strong association with both IV and DV, but the relationship between the IV and the DV reduces to zero when the MedV is entered into the relationship model
Review of Types of Experimentation • Experiment: Manipulate levels of an IV (treatments) to observe its affects • Randomized Experiment: Assign cases to levels of the treatment by some random process such as using a random number table or generator • Quasi-experiments involve comparisons between naturally occurring treatment groups (by self-selection or administrative selection). Researcher does not control group assignment or treatment, but has control over when/what to observe (DV) • Example might be people who work regular daytime hours vs. the night shift; • Researcher must rely on statistical controls to rule out extraneous variables such as other ways in which the treatment groups differ than the IV of interest. • Search for counterexamples and competing explanations is inherently falsificationist, as is searching for moderators and limiting conditions
Types of Experiments, cont’d • “Natural experiments” might typically involve before and after designs where you look at a DV of interest before and after some phenomenon that has occurred, for example, tying Presidential approval ratings to revelations about bailed-out bank excesses • Non-experimental designs (correlational studies) are basically cross-sectional studies which are correlational in nature in which the researcher makes an effort to establish causal influence through measuring and statistical control of competing explanations
Construct Validity and External Validity • Suppose I am doing a study on the impact of font size and face on usability of Web pages by the elderly. If I conduct a study in which I vary Web page default font size (10 pt, 12, pt, 14 pt, 16 pt) and face (serif, sans-serif) and then measure the time of first page pull to 1 minute after last page pull by a group of people in an assisted living facility, I have two sorts of generalizability concerns. • One, called construct validity, is how do I get from the particular units, treatments, and observations of my study to the more general constructs they represent. That is, is this study useful in answering the question I really want to get at, which is, if we make adaptations to Web pages that take into account the physical limitations associated with aging, will people spend more time on a Web site? Do these specific operationalizations tap the actual constructs (page design, time spent on the site) whose causal relationship we are seeking to understand?
External validity • The other, called external validity, is whether the causal relationships observed in this particular study hold across variations in units, treatments, observations, and settings. Would elderly still living at home respond in the same way? Would the results apply to cases where the elderly were allowed to set their own font size? Can we generalize our results to unstudied persons, web sites, page designs, etc.?
Improving the Ability to Generalize • What are some ways to improve construct and external validity? (e.g., overcoming the “local” nature of the typical experiment) • Most obviously, and most difficult to achieve in practice, is some form of probability sampling (random, cluster, stratified random, etc) of units (subjects or cases), treatments, observations, and settings • In practice, where do you get the sampling frame (list of all units making up the population of treatments, settings, etc) from which to randomly draw? • More likely the researcher will seek in the selection of units, treatments, settings, and observations to emphasize diversity (heterogeneity) and/or representativeness (typicality)
Validity • Notion of validity as a property of inferences, rather than a property of the experiment • The truth of any claims made about the results of an experiment are assessed by various standards • Correspondence between the claim and the “external world” of empirical evidence • Embedding of a claim within a network of relevant theory and claims; internal consistency, fidelity • Pragmatic utility of the claim in explaining that which is difficult to understand, ruling out alternative explanations; • Acceptance by other scientists; truth as a social construction
Types of Validity • Statistical conclusion validity: proper use of statistics to make inferences about • The nature of the covariation between variables • The strength of that relationship • Internal validity: extent to which a causal inference can be reasonably made about the observed covariation given the particulars of the manipulation (treatments) and measurements
Types of validity, cont’d • Construct validity: extent to which experimental operationalizations and procedures are valid indicators of the higher order constructs they represent • External validity: extent to which inferences about the causal relationship hold up under other UTOS (units, treatments, observations, settings) • Validity analysis is about the process of identifying potential threats to these four types of validity and controlling them or eliminating them wherever possible and directly assessing their impact if not • This is a theory-laden process and in practice it is difficult to identify all relevant threats, particularly all plausible alternatives
Statistical Conclusion Validity • Threats to statistical conclusion validity: improper use of statistics to make inferences about the nature of the covariation between variables (e.g., making a type I or type II error) and the strength of that relationship (mistakenly estimating the magnitude of covariation or the degree of confidence we can have in it) • Recommended that statistical hypothesis test reporting be supplemented with reporting of effect sizes (r2 or partial eta2), power and confidence intervals around the effect sizes
Threats to Statistical Conclusion Validity • Identification of several specific threats to statistical conclusion validity • Low statistical power • Power analysis has the purposes of deciding how large a sample size you need to get reliable results, and how much power you have to detect a significant covariation among variables if it in fact exists • Beyond a certain sample size the law of diminishing returns applies and in fact if a sample is large it can “detect” an effect that is of little real-world significance (i.e., you will obtain statistical significance but the amount of variation in DV explained by IV will be very small)
Statistical conclusion validity, cont’d • Statistical power usually should be .80 or higher. • Example of low power problem: failing to reject the null hypothesis when it is false because your sample size is too small. So suppose there is in fact a significant increase in side effects associated with higher doses of a drug, but you did not detect it in your sample of size 40 because your power was too low; doctors will then go ahead and prescribe the higher dose without warning their patients that they could experience an increase in side effects. You could deal with this problem by increasing the sample size and/or setting your alpha region error rate to a larger area than .05, for example .10 or .20
Calculating Statistical Power • Power can vary as a function of the robustness of the statistical test, the sample size, and anything that could make an effect “hard to detect” such as measurement error or the fact that it really is a small effect • Here’s an online power calculator that tells you for various sample sizes, alpha levels, number of subjects, and expected mean differences and standard deviation in the populations what level of statistical power you can expect. Let’s suppose you anticipate a small effect (a small difference of means of only two points between your two populations, but you believe that this is not just the result of sampling error. Will you be able to detect this effect?
Calculating statistical power, con’td • Try out the calculator with these values: The mean of population 1 = 50, the mean of population 2 = 52, their standard deviations each = 5, you’re doing a one-tailed test, the significance level (likelihood of rejecting the null hypothesis when it is in fact true, or Type I error rate) is .05, and the sample size from each of your two populations is 20 and 20, respectively. What is your power level? .352. Now increase your sample sizes to 40 and 40. How does that affect your power to detect the effect? Still not very good. How about decreasing your measurement error? (SDs = 2 for both samples) That helps a lot (99.8%). Now set your SDs back to 5 but change the confidence level from .05 to .20. That also improves the power with a small sample with large error, but probably won’t impress journal editors.
Threats to Statistical Conclusion Validity, cont’d • The power to detect an effect is a complicated product of several interacting factors such as measurement error, size of the predicted effect, sample size, and Type 1 error rate • Shadish et al. provide a comprehensive list of strategies for improving power to detect an effect (find differences between treatments or levels of the IV) (Table 2.3). These include increasing reliability of measure, increasing treatment strength, measuring and correcting for covariates, using homogeneous participants, equalizing cell sizes (N of subjects assigned to conditions) Many of these have to do with reducing possible sources of random and measurement error • In addition to inadequate power, there are further threats to statistical conclusion validity:
Threats to Statistical Conclusion Validity, cont’d • Failing to meet the assumptions of the test statistic, for example, that observations within a sample are independent in a t-test, which might result in getting significant differences between two samples but the real difference is more attributable to other factors the subjects had in common such as being from the same neighborhood or SES rather than the treatment they were exposed to; violating other assumptions like equality of population means, interval level data, normality of populations with respect to the variable of interest, etc.
Statistical Conclusion Validity, cont’d • Type I Error rate when there are multiple statistical tests. What starts out as .05 with one test becomes a very large probability of rejecting the null hypothesis when it is in fact true with repeated consultations of the table of the underlying distribution (normal table, t, etc.). It’s not the done thing to correlate 20 variables with each other (or to do multiple post-hoc comparisons after an ANOVA) and see what turns up significant, then go back and write your paper about that “relationship”
Protecting the error rate when there are multiple tests • Bonferroni correction divides the alpha error rate by the number of tests and then uses the corrected value in all the tests. This is the way to “play fair” • If there are 10 correlations in our fishing expedition we would set the significance level required for rejection of the null hypothesis at alpha=.05/10 or .005. Then even if we conduct ten tests our experimentwise error rate is still under .05 • Not everybody agrees with this assessment, saying that we already obsess so much about alpha levels and keeping them small that many interesting effects don’t get detected already and get tossed in the trash can, and this correcting just makes it worse
Threats to Statistical Conclusion Validity, cont’d • Unreliability of measure • Restriction of range: avoid dichotomizing continuous measures (for example substituting “tall” and “short” instead of actual height; using dependent variables where the distribution is highly skewed and there are only a few cases in one or the other ends of the scale • Lack of standardized implementation of the treatment or level of the independent variable (we talked about this before in terms of things like instructions being memorized over time, experimenter effects, etc) unless adaptive application of the treatment is a more valid instantiation of how the treatment would occur in the real world
Threats to statistical conclusion validity, cont’ • Within-subjects variability: In most analyses that look at effects of treatments you are going to want your between-treatment variability to be large in accordance with your research hypothesis, and if there is a lot of variability among the subjects within the treatment that may make it more difficult to detect the predicted effect. Trade-off between ensuring subject homogeneity within treatments, which increases power to detect the effect, and possible loss of external validity • Inaccurate effect-size estimation; recall how we talked about how the mean is affected by outliers. Sometimes there are some extreme cases or outliers that can adversely affect and perhaps inflate the estimates of effect sizes (differences on the DV attributable to the treatment or levels of IV)
Internal Validity • Does the observed covariation between the IV and the DV constitute a causal relationship, given the way in which the variables were manipulated/measured (its local or molar circumstances)? To qualify it must be the case that • The IV is chronologically prior to the DV • No other explanation is plausible • What are the threats to internal validity?
Threats to Internal Validity • Lack of clarity about causal ordering (more of a problem in correlational studies than in experiments in which you expose respondents to the treatment and then measure the outcome) • Systematic differences in respondent characteristics on variables other than the IV of interest. People in the treatment condition already have more of the DV property for some unknown, unmeasured reasons. Random assignment and pre-testing can reduce this threat from confounding variables • History: any events which intervene between the treatment and the outcome measure. Example; subjects are presented with anti-smoking messages but are allowed a break before completing the post-test and various events happen during their break such as seeing smokers who are/are not attractive role models, etc. More of a problem in studies which assess effects over long periods of time • Maturation: Both history and maturation are problems for reliability of measure as well as causal attribution. Could changes in the units of analysis (people, elements such as neighborhoods, organizations), which have occurred naturally be responsible for changes in the outcome which the experimenter is trying to attribute to the treatment?
Threats to Internal Validity, con’t • Regression to the mean: likely to be a problem in quasi-experiments when members of the group were selected (self- or administratively-) based on having high or low scores on the DV of interest. Testing on a subsequent occasion may exhibit “regression to the mean” where the once-high scorers score lower, or the once-low scorers score higher, and a treatment effect might appear when there really isn’t one. Having a really high score on something (like weight, cholesterol, blood sugar) etc might be sufficient to motivate a person to self-select into a treatment but the score might fall back to a lower level just naturally or through simply deciding to “get help,” although it could be attributed to the effects of the treatment. • Attrition; selective dropping out of a particular condition or level of the independent variable by people who had the most extreme pre-test scores on the DV, so when they drop out it makes the post-test mean for that condition “look better” and as if that treatment had a stronger effect since its mean would be lower without the extreme people
Threats to Internal Validity, con’t • Testing; as mentioned before simply taking a test can create change which can be mistaken for a treatment effect; can increase awareness of the DV and induce desire to change independent of what the treatment can produce (called test reactivity) • Instrumentation: changes in a measure over time (for example, coders may become more skilled, may develop favorite categories as they code more samples) or changes in its meaning over time • Random assignment can eliminate many potential threats to internal validity in experiments; in quasi experiments where that is not possible the experimenter should try to identify as many threats as possible and eliminate or control for them
Construct Validity • Construct validity has to do with the process of making inferences from the particulars of an experiment, for example its measuring instruments, to the higher-order constructs they represent • Could apply to the “units” (the subjects or cases, the treatments, the outcomes (measurements of the DVs) and the setting • Two problems: understanding the construct, and measuring it • With respect to understanding the construct, the most difficult task is a definitional one in which the researcher decides what the central or prototypical features of the construct are and delimits it such that it is also clear what the construct is not • Example: recent research on frustration based on an understanding of the concept grounded in the notion of the inability to reach a specific desired goal owing to circumstances over which the individual has little or no control. This was a very circumscribed notion placing the locus of frustration squarely in the particulars of a problematic situation and not, say, in the more popular use of the term when we describe a person as frustrated, meaning that they have a general life issue of being unable to meet goals. • Under this notion of the construct, prototypical features would include (a) a specific, short-term desired goal (b) barriers to goal attainment (c) little to no ability to remove the barriers
Practices that Promote Construct Validity • Clear description of the units (subjects, cases), treatments, outcomes (DV measures), and settings at the desired level of generalization • For example, clearly describe the outcome construct “parents’ beliefs about the role of internal and external factors in their children’s health care,” identifying the prototypical feature of interest as parental beliefs about the role of luck, own agency, and experts in children’s health outcomes. Distinguish it from related constructs such as “parents’ fears for their children’s health,” “parents’ trust of medical professionals,” or “parents’ approach/avoidance with respect to health issues” • Select specific instantiations of the construct, such as the “Health Locus of Control-Parent/Child” scale, which has items like the following: • If my child feels sick, I have to wait for other people to tell me what to do. • Whenever my child feels sick, I take my child to the doctor right away. • There is nothing I can do to make sure my child has healthy teeth. • I can do many things to prevent my child from having accidents.
Practices that Promote Construct Validity, cont’d • In an iterative process, compare the specific instantiation (the measuring instrument) to the construct and assess for goodness-of-fit. Note points of departure and make adjustments (to the measure, or to the description of the higher-order domain covered by the construct) as appropriate • Realize that the match will never be perfect because both the construct and its operational definition are socially constructed; realize that definitions are consensually-arrived at constructions and that the consensus can and will change
Practices to promote construct validity, cont’d • Think about a concept such as “middle class” and all the political baggage that attaches to it. If one wants to make a case that the middle class is better off today, or worse off today, the construct can be defined and operationalized in such a way as to support one’s preferred outcome • Similarly, “middle class” can refer as much to behaviors, practices, values, or perceived social standing as it does to economic indicators like household income • If you ask people what class they belong to, most people will select “middle class,” regardless of their SES • Realize also that constructs and the way they are used to classify people can have major social ramifications if social research is taken up and used to justify workplace/policy decisions, e.g., who is “needy,” who is an “employment risk,” etc.
Threats to Construct Validity • Inadequate construct explication: the researcher hasn’t done an adequate job of describing the prototypical characteristics of the construct and distinguishing it from other related constructs • Too general, too specific, inaccurate, doesn’t incorporate method • Example: women and “spatial reasoning”: how well or poorly women perform compared to men is a function of the testing environment (paper and pencil vs. 3D immersive) • Explicating a construct like “jurors” in jury research: People who volunteer for jury studies in exchange for a free meal are different from people who resentfully show up for jury duty and try to get out of it but are impaneled anyhow
More Threats to Construct Validity • Construct confounding: the measurements may tap some extraneous constructs not part of the construct of interest; subjects in the sample are thought to represent “impoverished urban elderly” because of their participation in free/low cost meal programs but they may also be the healthy/ ambulatory/ psychologically sturdy elderly who can walk to the centers for their meals or who can afford to pay but come for the company. They may differ from other urban seniors on a host of factors • “Mono-operation bias:” using only one measure of a construct, say only one example of a “pro-safe-sex” message to represent the larger construct or one dependent measure of say, loneliness, where several different measures of the same construct (for example, several subsets of items from the same “item universe”) would lend weight to results
Threats to Construct Validity, cont’d • “Mono-method bias”: this is a tricky one, as using multiple methods and getting similar outcomes (similar IV>DV relationship) may improve construct validity, but may also introduce method variance which can result in nonsignificant findings due to different sources of random error, which would not have been the case with a single method • Confounding constructs with levels of constructs: involves failing to adequately calibrate, at a conceptual and operational level, the independent variable such that variations in levels of it can be observed independently and assessed for effect. • May be particularly problematic for effects whose impact on the DV is curvilinear, such that there is impact at very high or very low levels of the IV, but not at intermediate levels: ex: satisfaction with a day’s shopping may be greatest when the expenditures were very high (I have made an investment purchase that will last for years) or very low (I am a smart shopper who got a great bargain), but lowest with a medium level of expenditure.
Threats to Construct Validity, cont’d • Reactive self-report changes: Reactivity refers to the property of treatments and experimental settings that produce changes in DVs that can get confused with the intended effects of a treatment or level of the IV • For instance, people wanting acceptance into a clinical program may over-report or under-report their symptoms depending upon what they think is required. Similar to the regression to the mean issue but the notion is not of a statistical regression where the need for the treatment peaks before assignment and then declines due to relief after treatment is provided, but a change from behavior designed to induce/avoid assignment (like playing sick to get out of work) to more representative behavior after the assignment is made.
Threats to Construct Validity, cont’d • Reactivity to the experimental situation: refers to demand characteristics of the treatments or settings which induce subjects to form hypotheses about what’s going on and act accordingly. Various solutions including separating treatments and outcome measurement as far as possible in time and space • Use of Solomon four-group design if sensitivity to the pre-test is an issue. In Solomon design one experimental and one control group receive the pre-test, then the treatment (or control equivalent), then the post-test, while another pair of experimental and control group receive the treatment (or non-treatment) and post-test only, no pre-test. Then do a a 2 X 2 between-groups analysis of variance (Treatment/No Treatment x Pretest/no-Pre-test) on the posttest scores. A significant interaction between the Treatment and Pre-test factors would indicate that subjects were sensitized by the pre-test • Experimenter expectancies: experimenter may convey his/her hopes and expectations without being aware of it. A principal motivation for double-blind procedures, or use of experimenters who are unfamiliar with the hypotheses
Threats to Construct Validity, cont’d • Novelty and disruption effects: Hawthorne effect; positive change may not be due to the treatment but to excitement of being in an experiment (being paid attention, something to liven up the workday) • Compensatory equalization: refusal of managers, nonresearchers who administer the treatment to cooperate with the random assignment schedule and want to make benefits available to all • Compensatory rivalry: occurs when people in a control or less-favorable treatment condition are aware of the other more favorable condition and put forth extra effort to score “high” (or low if required) on the outcome measures
Threats to Construct Validity, cont’d • Resentful demoralization: the opposite effect where receiving the less favorable treatment can cause scores on the outcome measure to be lower than what they would have been without the knowledge that others were getting a “better” treatment. These problems are likely to occur in quasi-experimental, real-world designs but can usually be controlled in the laboratory setting • Treatment diffusion: when participants in the control group somehow receive all or part of the treatment. For example, in the Shanghai study of effect of BSE training on breast cancer mortality, no reduction of mortality was found for the trained groups as opposed to the untrained groups. There was some speculation afterwards that women who received the training actually trained their friends and neighbors, some of whom were in the control group (just speculation) • Wow, what a long list of things that can go wrong! • But there’s more….
External Validity • External validity is the extent to which a causal relationship, obtained (or not found) under specific units, treatments, observations and settings would hold over variations in the units, treatments, outcomes, and settings. In short, it is mostly concerned with the UTOS which were *not* in the study • For example, with respect to the aforementioned Shanghai study, would the obtained lack of (BSE training>decreased mortality) relationship apply to women in Western societies where mammography might be more readily available, where BSE had been preached for years as good practice, where the typical woman might have a higher BMI, etc • If you were an administrator at the NCI and you were charged with making a recommendation to the nation’s women about BSE, would you recommend against it on the basis of such a study? It ran for nearly ten years and featured thousands of women, so there was ample time and statistical power to detect an effect for BSE if one were present • What about the recent research suggesting that HRT increases the risk of stroke? Knowing that many of the study respondents were women who first began taking HRT when they were in their 60s and older, would you recommend that women who started taking HRT in their 50s now stop taking it?
Threats to External Validity • Here are ways a causal relationship might not hold across UTOS • Interaction of the causal relationship with units; effect found with women might not hold for men; might apply in some zip codes but not others, etc; might apply to guinea pigs but not people • Interaction of the causal relationship over treatment variations; for example, relationship between training in using Blackboard and likelihood of using it in class may vary if one group is taught with Power Point slides or handouts and another has to take notes • Interaction of the causal relationship with observations; variations in measurement of outcome variable will affect whether or not the obtained causal relationship holds , for example, measuring hours spent watching TV with self-report vs. using a usage monitor attached to the TV • Interaction of the causal relationship with settings: in seniors and Internet study, treatment effect (availability of computer classes in increasing Internet-based social support networks) may vary considerably between the new beautiful cybercafe at one meal site and the unglamorous, less well-furnished premises available at another
Threats to External Validity, cont’d • One final threat is the effect of a mediating variable which has an impact in one setting or with one class of subjects, but not with another • Example: Effects of gender on job category may be mediated by educational attainment for certain types of industry but not for others (e.g., for creative industries vs. traditional, conservative fields where even women with advanced degrees may not have equal job status) • In general we can’t hope to get the same effect sizes for a causal relationships across moderators (different UTOS) but we could at least hope that the *direction* of the causal impact is consistent, e.g., more Y leads to more X • The only place a consistent effect size (at least consistent with respect to small or large) would be really important would be in medical applications or in research that has clear implications for policy • Random sampling can reduce some but not all threats to external validity; purposive sampling of groups known to be diverse on unit or setting variables can improve ability to generalize about causal relations
So Many Threats to Validity • Given all of these possible ways in which your experimental design can be compromised, why do experiments? Because they provide the closest match to the process of causal reasoning (cause before effect, plausible alternatives accounted for) • Most of the threats happen infrequently • If they happen frequently, they can usually be anticipated and controlled for with physical controls or statistically • They may be discovered after the fact, but it may be possible to re-analyze data, collapse categories, etc to deal with the problem
So Why do Experiments? • Studies may be extended to evaluate the role of moderator variables which might threaten external validity • External validity is the most obvious source of lay criticism but perhaps the last concern of an extended research program which must address the other validity issues first • Reporting of research should lay out the possible threats to the four types of validity and say how they were addressed. What wasn’t addressed goes under “limitations” • Your greatest attribute as an experimenter is your common sense and your everyday understanding of how the world works; gaps in this knowledge can be filled in by reading and communicating with others