430 likes | 612 Views
Chapter 6: Basics of Experimentation. Experiment—A test designed to arrive at a causal explanation (Cook & Campbell, 1979)
E N D
Chapter 6: Basics of Experimentation • Experiment—A test designed to arrive at a causal explanation (Cook & Campbell, 1979) • Mill (1843)—Joint method of agreement and difference: causation can be inferred if some result, X, follows an event, A, if A and X vary together and it can be shown that event A produces result X • If A occurs, then so will X, and if A does not occur, then neither will X • If event B occurs, then X does not occur
Chapter 6 continued: • Tip-of-the-Tongue (TOT) example: X = correct resolution of the TOT state, A = presenting letter initials, B = repeat question, C = present picture of celebrity • Subjects were instructed to name celebrities, and 10.5 instances per subject resulted in TOT states (the name of the celebrity was on the “tip of the subjects tongue,” but they could not actually recall it) • Subjects showed significantly better resolution to TOT states with letter initials as a cue compared to either repeating the question cue or presenting a picture of the celebrity • This suggests that memory for celebrities is coded using letter-level orthographic information rather than visual or “data warehouse” related information codes • However, we are not told whether conditions B or C differed significantly from chance—if they are above chance, then this suggests that this type of coding occurs, but is less common than orthographic coding • Also, in so-called TOT states, it could have been that the unresolved cases were actually due to subjects no knowing the name of the celebrity
Chapter 6 continued: • Joint Method of Agreement and Difference continued: • Note that in the real world of science, A does not always produce X, and the absence of event A does not always fail to produce X (because science is inductive, or probabilistic rather than deductive) • Thus, the inductive version of Mill’s joint method of agreement and difference is that Event A (presenting letter initials) produces significantly more resolution of X than event B (repeating the question) or event C (presenting a picture of the celebrity) • So, “more” is defined by statistical significance • Statistical significance tests whether two (or more) means differ even when we consider error variance (noise) • This is why we refer to statistics as the “language of science”
Chapter 6 continued: • In many experiments in psychology, you compare a neutral baseline (e.g., repeating a question in our TOT example—although if this question was coded as a contextual cue with the memory for the celebrity’s name, then this would not have been a neutral baseline!) • The experimental condition (presenting celebrity’s initials) should show a significantly larger effect on the DV (percent recall of the celebrity’s name) than the control condition(s) • Experimental control is central to an experiment because it allows the production of a comparison by controlling the occurrence or nonoccurrence of a variable (while holding all other possible causes constant so they cannot affect the outcome) • Control has three components: • Comparison (the control condition is used as a comparison) • Production (levels of values of the IV can be produced) • Constancy (the experimental setting can be controlled by holding certain aspects constant)
Chapter 6 continued: • Advantages of Experimentation: • Using animal models can sometimes save money and is considered to be more ethical by many • E.g., cosmetics are frequently tested on rabbits • But this has been very controversial! • Experimentation has more control than ex post facto research in which levels of the IV are selected after the fact (selected rather than manipulated with control) • E.g., research on the health consequences of smoking has been almost entirely correlational • The Tobacco lobby actually used as a defense in trial that cigarette smokers were more likely to develop lung cancer than non-smokers because smokers were more neurotic—and it was really the higher levels of neuroticism that were causing the cancer risk! • One cannot rule out this because neuroticism was not controlled
Chapter 6 continued: • Variables in Experimentation: • IVs—are manipulated by the experimenter because they are hypothesized to cause changes on the DV • Failure to find an effect of the IV on the DV is termed a “null result” • This can be due to either a lack of an effect, an invalid manipulation, or a lack of statistical power • DV—the performance variable observed and recorded by the experimenter. A good DV (e.g., RT or accuracy) should be reliable and should not be overly sensitive to floor or ceiling effects • Floor effect—when it is impossible to do any worse on a task because you are already at the bottom • Ceiling effect—when it is impossible to improve because you already are at perfect performance • CV—potential IVs that are held constant during an experiment • This is usually because one can only manipulate a small number of variables (usually five or fewer) in any given experiment • If a potential IV that is not manipulated is not controlled, then it can become a confounded variable
Chapter 6 continued: • Review four examples of experiments from the text
Chapter 6 continued: • More than one IV: A typical experiment will manipulate between 2-4 IVs • This is done because it is more efficient—experimental control is typically superior with multiple IVs, and the results can be generalized across a group of IVs rather than just a single IV • Multiple IVs also allow a researcher to examine both main effects (an effect of just one IV in isolation) and interactions (when the effects produced by are not the same across the levels of a second IV) • Interactions allow us to examine joint effects of multiple IVs and add increased precision • An interaction takes precedence over main effects
Chapter 6 continued: • More than one DV: we analyze just one DV at a time in univariate statistics • If we truly wish to analyze two of more DVs at a time, this is a multivariate statistical technique (such as MANOVA) • In a MANOVA, we form a composite DV form from multiple DVs • But MANOVAs do not tell us whether the pattern of effects are consistent across DVs • We can use correlations across trial blocks, diffusion models, or entropy/RT models to look at the overall pattern of results across multiple DVs (e.g., RT and errors) • However, these techniques are complicated • Consequently, most experiments in psychology use a single DV and are analyzed using ANOVA
Chapter 6 continued: • Possible sources of experimental error: • Demand characteristics or reactivity—Hawthorne effect (Homans, 1965) • Deception can be used to prevent demand characteristics • Because subjects do not know what is being tested, they cannot be biased through reactivity • However, if an experimenter uses deception, they typically need to debrief participants after the study
Chapter 6 continued: • External validity of the research procedure: • Representativeness of subjects—the ability to generalize across different participant populations • Are rats really representative of humans? • E.g., rats’ basal ganglia system is probably different from that of humans • Variable representativeness—the ability to generalize across different experimental manipulations • E.g., the relationship of background noise to studying efficiency (do noise and music both impair performance) • Setting representativeness—the representativeness of the experimental setting (or ecological validity) • Realism is not the same as generalizability, though
Chapter 7: Validity and Reliability in Psychological Research • Validity—the truth of an observation • Types of Validity: • Predictive validity—checking the truth of an observation by comparing it to another criterion that is thought to measure the same thing • We will use SAT I as an example • Criterion—another measurement of behavior that serves as a standard for the measurement in question (e.g., ACT, college freshman GPA) • In predictive validity, the relation between two scores is typically assessed by a statistic termed the correlation coefficient (e.g., Pearson’s product-moment correlation coefficient) • The better the prediction of the observation (e.g., SAT I score predicting college freshman GPA), the greater the predictive validity of the predictor score • However, predictive validity does not define a measure or construct • E.g., We cannot assume that a person with a higher SAT I score than another person is smarter than the other person because predictive validity does not allow us to do this unless our criterion is that sort of measurement • E.g., an intelligence test score rather than freshman GPA
Chapter 7 continued: • Types of Validity continued: • Construct Validity—the degree to which the independent and dependent variables accurate reflect or measure what they are intended to measure (Cook & Campbell, 1979; Judd et al., 1991)—really, are the names accurate? • In our Stroop experiment from Chapter 1, did our tasks really reflect reading and does scan really reflect reading performance? • Counting the number of digits in a row is probably not a good measure of reading • Extraneous Variables—confounding variables that may be a source of invalidity can threaten construct validity • Reading aloud requires speech production processes that are not required in reading, and Tasks 2 and 3 required counting which is not the same as reading • Katz et al. (1990) have also claimed that the SAT I is not construct valid • Freedle and Kostin (1994) found that SAT test takers did use the passages to respond, so they found some construct validity • Reactivity and Random Error • Subjects could have been afraid of looking like a poor reader on a Stroop task • Some subjects could have been tested with a second hand on a watch, and others could have been timed with a chronograph (a stopwatch), this could have led to random error in timing precision
Chapter 7 continued: • Construct Validity continued: • We can improve construct validity by using an operational definition (a recipe for specifying how a construct, such as reading, is produced and measured) • This is because operational definitions allow the conditions that produce the concept to be measured and defined • In our Stroop example, reading is reduced to the independent variables that produce it and the dependent variable(s) that that is used to measure it • Protocols—the specification of how the measurement and procedures are to be undertaken—also reduce the risk of construct invalidity because they reduce the likelihood of random error • Circular reasoning is a potential problem when using an operational definition, though • We need to have a method of defining something independent of how we measure it • Some have claimed that the concept of processing resources suffers from this problem (circularity, Navon, 1979) • However, we can use PRP and coactivation methods
Chapter 7 continued: • Construct validity is usually demonstrated using psychometric methods: • Factor Analysis • A data reduction method in which you determine which measured variables are related to which constructs • You can also show that you constructs from factor analysis are related in the manner predicted by your theory using causal analysis: • Path analysis or Structural Equation Modeling (or covariance structure modeling) • Item Response Theory (or IRT)—is a mathematical technique for determining which items on a test measure the same construct
Chapter 7 continued: • Types of Validity continued: • External Validity—the extent that we can generalize our research results (in this setting measured on this sample) to other settings and other populations or samples • To demonstrate external validity, we need to replicate our initial results in other settings and on different people • Hypertension, gender and race • Our experimental setting needs to be representative of the typical situation (e.g., reading is typically tested using a reading out-loud method in elementary school even though this is not an accurate measure of reading comprehension—it is more of a measure of speech perception or production)
Chapter 7 continued: • Internal Validity—when we can make causal statements about the relationship between IVs and DVs • Specifically, when your IV causes an effect on the DV (are we testing what we claim to be testing—although this can be similar to construct validity) • Without internal validity, we are not doing science • Internal validity requires good experimental control • This is at odds with external validity because as we increase experimental control, our results become less generalizable! • A major challenge in science is to maximize both internal and external validity even though they are negatively correlated • We can do this by keeping good experimental control and by comparing our results across multiple samples with large sample sizes
Chapter 7 continued: • Reliability—the consistency of behavioral measures • Types of Reliability: • Test-Retest: giving the same test twice in succession over a short time interval in order to measure consistency (using a correlation coefficient to measure consistency) • Parallel Forms: giving two versions of a test on two testing occasions to determine whether they result in consistent scores • Split-Half: dividing test items from a single test into two arbitrary groups and correlating the resulting scores after administration—if the correlation is sufficiently high, then test reliability is confirmed (this also establishes the equivalency of your test items)
Chapter 7 continued: • Statistical Reliability and Validity: • Statistical Reliability determines whether findings are the result of chance • If not, we assume that the results occur because of the effect of the IV(s) on the DV • Statistical validity is whether we are measuring what we claim to be measuring • We sample subjects from a population when we use inferential statistics • The sample size needs to be large enough in order for the sample to estimate its underlying population(s) • The Central Limit Theorem states that samples of 20-30 allow us to assume that a sample estimates the shape of the population • Increasing sample size typically increases statistical power—the ability to reject a false null hypothesis • Random Sampling increases the likelihood that the obtained sample does estimate accurately the characteristics of the population that it is attempting to estimate
Chapter 7 continued: • Types of errors in inferential statistics: • Type I error—the probability of rejecting a true null hypothesis (the alpha level) • Type II error—when you fail to reject a false null hypothesis • 1-probability of a Type II error = power
Chapter 7 continued: • Measurement procedures—a systematic method of assigning numbers or names to objects and their attributes: • Nominal scale—labels with no quantitative significance • Ordinal scale—measures differences in magnitude (ranks), but not how much • Interval scale—measures differences magnitude as well as how much different • Ratio scale—same as interval except with an added absolute zero—so you can determine how many times greater something is
Chapter 8: Experimental Design • Internal Validity in Experiments—by using experimental control, the researcher can rule out confounding variables as a cause, so that one’s results really do reflect an effect of the IV on the DV • Internal validity requires careful selection of IVs and a well thought-out experimental design • You can never “fix” design problems are the analysis stage • Although you can use “statistical control” through the use of ANCOVA • In this chapter, we will discuss two main types of experimental designs—between subjects and within subjects • Between Subjects—independent groups of subjects receive the different levels of the IV • Within Subjects—all subjects receive all levels of the IV
Chapter 8 continued: • Crossed versus Nested designs: • A crossed design is a factorial design—there are no empty cells • A nested design is when subjects receive different levels of the IV • You have empty cells • You only use this design in special situations because you cannot interpret interactions • You might use a nested design to save money when only certain cells are of interest • A placebo design is nested • But you can treat this as a crossed design—see example
Chapter 8 continued: • Why experimental design matters and how even with the best of intentions you must be very careful in interpreting your results • Example of a between subjects design: Executive Monkeys—Brady (1958) found that “executive monkeys” in control of when they were shocked were more likely to develop ulcers than “blue-collar” monkeys that had no control over when they were shocked • However, Weiss (1968,1971) found that executive rates that had control over when an electric shock was administered were less likely to develop ulcers than helpless rats that had no control over when electric shocks were administered (this is an example of learned helplessness • The discrepancy occurred because Brady randomly assigned high response-rate monkeys to the executive monkey condition (“neurotic monkeys”)—an individual difference • The moral of the story is that individual differences are ALWAYS confounded with IV effects in a between subjects design • With large sample sizes, hopefully this would not occur • Also, replication is essential to catch these errant results
Chapter 8 continued: • To see if the animal results of the effect of unavoidable stress on performance generalizes to humans, many researchers look at the effect of different stressors on cortisol (a stress hormone) • Meta-analysis (Dickerson & Kemey, 2004) has shown that cognitive tasks (e.g., mental arithmetic) and public speaking cause cortisol levels to rise, but that noise exposure and emotion induction do not • So, stress does increase cortisol levels in humans as well as non-human animals • Chronically high levels of cortisol can cause cell death in the hippocampus and amygdala
Chapter 8 continued: • Example of a within subjects design: experiments with LSD • Jarrard (1963) looked at the dose response curve of LSD on rats (by looking at the rate of lever pressing with salt water being the control) • Jarrard counterbalanced the order the dose (.05, ,.10, .20, .40, .80 milligram per kilogram of body weight) • Jarrard found that the two smallest doses slightly enhanced the response rate but that the two highest doses severely impaired response rate • One problem with drug studies using a within subjects design is that carryover effects may be so strong that counterbalancing cannot correct them • So, you may need to use a between subjects design for this type of study
Chapter 8 continued: • Types of Experimental Designs: • Between-subjects—a conservative design that prevents carryover effects (by using different subjects for different levels of the IV) • However, this design is extremely susceptible to individual differences confounding results • In order to minimize individual differences confounding one’s results, one can use matching (important subject characteristics are matched in the various treatment conditions) and randomization (random assignment) • However, subject attrition can make matching difficult, although newer mixed models can be used to analyze the data with missing data points
Chapter 8 continued: • Within subjects designs—are more efficient and control for individual differences (because each subject serves as their own control), but this design is sensitive to carryover effects (e.g., practice and fatigue effects) • Counterbalancing can help minimize carryover effects • Factorial counterbalancing is the most comprehensive method, although it may not be practical (go over factorials) • A Latin square design can simplify counterbalancing • Balanced Latin square: for an even number of conditions: 1, 2, n, 3, n-1, 4, n-2 … • For an odd number of conditions, two squares are needed (the one above and a second reversed square) • Another option is to use a modular counterbalancing scheme (n-1)
Chapter 8 continued: • Control condition—in its simplest form, a group that does not receive a treatment • It is a baseline against which some other variable in the experiment can be compared • Mixed designs—when you have at least one between subjects variable and at least one within subjects variable • Choosing an experimental design: Issues to consider • Carryover effects in a within subjects design • Individual differences in a between subjects design
Chapter 9: Complex Designs • Factorial Designs—we use these complex designs because real-world information processing is complex and requires multiple IVs • As we begin to understand a phenomenon better, the complexity of our experiments tends to increase from single IVs to many IVs
Chapter 9 continued: • Main Effects and Interactions • Color (hue), Case Type, and Spacing in visual word recognition • If we use a fast achromatic (magnocellular) channel and two slower parvocellular (one chromatic and one achromatic) channels to recognize words on a lexical decision task, then we should see a different pattern of hue effects for consistent lowercase versus mixed-case presentation • If this effect is due to the channel dynamics mentioned above, it should be relatively consistent for spaced and unspaced words
Chapter 9 continued: • Main Effects: when we look at the effect of one IV collapsed across all other IVs • In our case, a main effect for case type • Interaction: when the effects of one IV depend upon the levels of another IV • In our case, a Case Type x Hue Type interaction, but no three-way interaction • Because interactions typically qualify main effects, if you have an interaction, then you need to make sure that the interaction does not attenuate or eliminate your main effects • Control in between subjects designs: random-groups and matched-groups designs
Chapter 9 continued: • Complex within subjects designs (such as our example above) • Block randomization or complete randomization (we used complete randomization) • Mixed designs: when you have at least one between subjects variable and at least one within subjects variable
Chapter 10: Small-n Experimentation • Small-n Experimentation—when a very few subjects are studies intensely • This design framework is often used for non-human animal research (because of the expense and logistic complexity of testing large numbers of, say, rats) • It is also used for special populations of humans that are difficult to obtain (e.g., progeria cases) and for clinical populations (e.g., ADHD children or patients with peculiar brain damage—such as H.M.—that can be difficult to obtain because of privacy issues—although this is of questionable validity because there are costs to using this approach) • The main cost in using small-n designs is that you are using descriptive statistics • That is, you are not obtaining a sample and assuming that this sample estimates a population—you are simply describing this small group of individuals • You must be very careful in assuming that these results generalize to a population, as a whole
Chapter 10 continued: • Types of Small-n designs: • The AB design—A represents a baseline condition before, say, therapy (the control condition of the IV) and B represents the condition after the introduction of therapy (the treatment condition of the IV) • This design is used in some research—although it is a very poor design because changes that occur during treatment in the B phase may be caused by other uncontrolled variables that are confounded with therapy in that they really cause the change on the DV • E.g., development (the passage of time during which we mature)
Chapter 10 continued: • Small-n designs continued: • ABA (or ABAB) or reversal design—a design in which there are interspersed baseline (A) and treatment (B) phases of manipulation: • This design rules out maturation, so it is superior to an AB design
Chapter 10 continued: • Small-n designs continued: • Before an ABA design is used, usually researchers use a “functional analysis of behavior” (a la Skinner) approach to better understand the phenomenon of interest • In a functional analysis of behavior study, one attempts to discover the antecedents and consequences of a given behavior in considerable detail • Functional relationship—the functional relation between what leads to the target behavior and the consequences that it produces • The Contingency—the relationship between the behavior and the outcome (includes reinforcement, punishment, escape, and avoidance) • The Discriminative Behavior—the controlling stimulus or stimuli that cause the unwanted behavior
Chapter 10 continued: • Small-n designs continued: • Alternating Treatments Design (ACABCBCB) (A = no treatment, B = cookie with no dye, C= cookie with dye that potentially causes hyperactivity)—more than one IV is used, and there may be numerous baseline periods • This design extends the ABAB design because it allows multiple IVs (or at least a control condition) • However, it does not work well when carryover effects are present with some or all of the IVs (but the same holds for an ABAB design) • In the Rose (1978) study, the two hyperactive girls showed no difference between A and B, but they did show more hyperactivity in the C condition—suggesting that the dye caused the increase in hyperactivity rather than the cookie, per se
Chapter 10 continued: • Small-n designs continued: • The Multiple-Baseline design—can be used with a between-subjects design to overcome carryover effects—several behaviors (within subjects) or several people (between subjects) receive baseline periods of varying length, after which the IV is introduced (you can also look across settings) • One behavior is allowed to occur under baseline conditions (e.g., crying) and then the experimenter switches to the treatment • The timing of the onset of treatment is varied across subjects—if the treatment consistently is associated with a change in behavior (when other potential causes are held constant), then it is assumed that the treatment caused the change in behavior • You can use this same approach with the same subjects across different behaviors with different timing of the onset of the treatment—if the treatment for crying reduces crying but it does not affect fighting (and vice versa), then you can assume that your treatment caused the change in behavior
Chapter 10 continued: • Small-n designs continued: • The changing-criterion design—a method in which the researcher changes the behavior necessary to obtain reinforcement • If the behavior changes systematically with the changing criteria (e.g., you have to ride 5 miles instead of 3 miles on a stationary bike to get bonus points), then one assumes that the reinforcement criteria are producing the change • That is, if the experimenter removes the incentive completely (e.g., points that can be used to buy video games if 11-year-old boys exercise, DeLuca & Holborn, 1992), the level of exercise decreases back to zero • Note that if people base behavior on just external rewards, then this is not a good situation (e.g., if children do not clean their room unless they get paid to do so, then their house will be in bad chape when they are an adult)
Chapter 10 continued: • Clinical Psychology—case studies: typically based on one patient with a disorder (e.g., H.M.) • Nissen et al.’s (1988) study of a dissociative identity disorder (multiple personality disorder) patient using memory tasks • This study is interesting because the explicit task showed an effect but the implicit task did not—contrary to the authors’ interpretation, it could have been that the DID patient was simply not able to catch the automatic processing but he could the processing of which he was consciously aware