Understanding Correlation and Causality in Research

Correlation and Simple Linear Regression PSY440 June 10, 2008

A few points of clarification • For the chi-squared test, the results are unreliable if the expected frequency in too many of your cells is too low. • A rule of thumb is that the minimum expected frequency should be 5 (i.e., no cells with expected counts less than 5). A more conservative rule recommended by some is a minimum expected frequency of 10. If your minimum is too low, you need a larger sample! The more categories you have the larger your sample must be. • SPSS will warn you if you have any cells with expected frequency less than 5.

Regarding threats to internal validity • One of the strengths of well-designed single-subject research is the use of repeated observations during each phase. • Repeated observations during baseline and intervention (during an AB study, e.g.) helps rule out testing, instrumentation (somewhat) and regression. These effects would be unlikely to result in a marked change between experimental phases that is not apparent during repeated observations before and after the phase change.

Regarding histograms The difference between a histogram and a bar graph is that the variable on the x axis (which represents the score on the variable being graphed, as opposed to the frequency of observations) is conceptualized as being continuous in a histogram, whereas a bar graph represents discrete categories along the x axis.

About the exam…. Exam on Thursday will cover material from the first three weeks of class (lectures 1-6, or everything through Chi-Squared tests). Emphasis of exam will be on generating results with computers (calculations by hand will not be emphasized), and interpreting the results. Exam questions will be based mainly on lecture material and modeled on previous active learning experiences (homework and in-class demonstrations and exercises). Knowledge of material on qualitative methods and experimental & single-subject design is expected.

Before we move on….. Any questions?

Today’s lecture and next homework Today’s lecture will cover correlation and simple (bivariate) regression. Homework based on today’s lecture will be distributed on Thursday and due on Tuesday (June 17).

Correlation • A correlation is the association between scores on two variables • age and coordination skills in children, as kids get older their motor coordination tends to improve • price and quality, generally the more expensive something is the higher in quality it is

Correlation and Causality Correlational research • Correlation as a statistical procedure is generally used to measure the association between two (or more) continuous variables • Correlation as a kind of research design refers to observational studiesin which there is no experimental manipulation.

Correlation and Causality Correlational research • Not all “correlational” (i.e., observational) research designs use correlation as the statistical procedure for analyzing the data (example: comparison of verbal abilities between boys and girls - observational study - don’t manipulate gender - but probably analyze mean differences with t-tests). • But: Virtually of the inferential statistical methods (including t-tests, anova, ancova) covered in 440 can be represented in terms of correlational/regression models (general linear model - we’ll talk more about this later). • Bottom line: Don’t confuse design with analytic strategy.

One might argue that turbulence cause coffee spills One might argue that spilling coffee causes turbulence Correlation and Causality • Correlations (like other linear statistical models) describe relationships between variables, but DO NOT explain why the variables are related Suppose that Dr. Steward finds that rates of spilled coffee and severity of plane turbulence are strongly positively correlated.

One might argue that bigger your head, the larger your digit span 1 24 37 21 15 One might argue that head size and digit span both increase with age (but head size and digit span aren’t directly related) AGE Correlation and Causation Suppose that Dr. Cranium finds a positive correlation between head size and digit span (roughly the number of digits you can remember).

One might argue that bigger your head, the larger your digit span 1 24 37 21 15 One might argue that head size and digit span both increase with age (but head size and digit span aren’t directly related) AGE Correlation and Causation Observational research and correlational statistical methods (including regression and path analysis) can be used to compare competing models of causation, to see which model fits the data best.

Relationships between variables • Properties of a statistical correlation • Form (linear or non-linear) • Direction (positive or negative) • Strength (none, weak, strong, perfect) • To examine this relationship you should: • Make a scatterplot - a picture of the relationship • Compute the Correlation Coefficient - a numerical description of the relationship

Graphing Correlations • Steps for making a scatterplot (scatter diagram) • Draw axes and assign variables to them • Determine range of values for each variable and mark on axes • Mark a dot for each person’s pair of scores

Y 6 5 4 3 2 1 X 1 2 3 4 5 6 Scatterplot • Plots one variable against the other • Each point corresponds to a different individual A 6 6

Scatterplot • Plots one variable against the other • Each point corresponds to a different individual Y 6 A 6 6 5 B 1 2 4 3 2 1 X 1 2 3 4 5 6

Scatterplot • Plots one variable against the other • Each point corresponds to a different individual Y 6 A 6 6 5 B 1 2 4 C 5 6 3 2 1 X 1 2 3 4 5 6

Scatterplot • Plots one variable against the other • Each point corresponds to a different individual Y 6 A 6 6 5 B 1 2 4 C 5 6 3 D 3 4 2 1 X 1 2 3 4 5 6

Scatterplot • Plots one variable against the other • Each point corresponds to a different individual Y 6 A 6 6 5 B 1 2 4 C 5 6 3 D 3 4 2 1 E 3 2 X 1 2 3 4 5 6

Scatterplot • Plots one variable against the other • Each point corresponds to a different individual • Imagine a line through the data points Y 6 A 6 6 5 B 1 2 4 C 5 6 3 • Useful for “seeing” the relationship • Form, Direction, and Strength D 3 4 2 1 E 3 2 X 1 2 3 4 5 6

Scatterplots with Excel and SPSS In SPSS, charts menu=>legacy dialogues=>scatter/dot=>simple scatter Click on define, and select which variable you want on the x axis and which on the y axis. In Excel, insert menu=>chart=>xyscatter Specify if variables are arranged in rows or columns and select the cells with the relevant data.

Linear Non-linear Form

Y Y X X Positive Negative Direction • X & Y vary in the same direction • As X goes up, Y goes up • positive Pearson’s r • X & Y vary in opposite directions • As X goes up, Y goes down • negative Pearson’s r

Strength • The strength of the relationship • Spread around the line (note the axis scales) • Correlation coefficient will range from -1 to +1 • Zero means “no relationship”. • The farther the r is from zero, the stronger the relationship • In general when we talk about correlation coefficients: Correlation coefficient = Pearson’s product moment coefficient = Pearson’s r = r.

r = 1.0 “perfect positive corr.” r2 = 100% r = -1.0 “perfect negative corr.” r2 = 100% r = 0.0 “no relationship” r2 = 0.0 -1.0 0.0 +1.0 The farther from zero, the stronger the relationship Strength

The Correlation Coefficient • Formulas for the correlation coefficient: Conceptual Formula Common Alternative

The Correlation Coefficient • Formulas for the correlation coefficient: Conceptual Formula Common alternative

X Y 6 6 1 2 5 6 3 4 3 2 Computing Pearson’s r (using SP) • Step 1: SP (Sum of the Products) 3.6 4.0 mean

= 1 - 3.6 -2.6 = 5 - 3.6 1.4 = 3 - 3.6 -0.6 -0.6 = 3 - 3.6 Quick check Computing Pearson’s r (using SP) • Step 1: SP (Sum of the Products) X Y = 6 - 3.6 6 6 2.4 1 2 5 6 3 4 3 2 3.6 4.0 0.0 mean

2.0 = 6 - 4.0 = 2 - 4.0 -2.0 2.0 = 6 - 4.0 = 4 - 4.0 0.0 = 2 - 4.0 -2.0 Quick check Computing Pearson’s r (using SP) • Step 1: SP (Sum of the Products) X Y 6 6 2.4 -2.6 1 2 5 6 1.4 3 4 -0.6 3 2 -0.6 3.6 4.0 0.0 0.0 mean

4.8 = = = = = * * * * * 5.2 2.8 0.0 1.2 Computing Pearson’s r (using SP) • Step 1: SP (Sum of the Products) XY 6 6 2.4 2.0 -2.6 -2.0 1 2 5 6 1.4 2.0 3 4 -0.6 0.0 3 2 -0.6 -2.0 3.6 4.0 0.0 0.0 14.0 SP mean

Computing Pearson’s r (using SP) • Step 2: SSX & SSY

2 2 2 2 2 = = = = = 6.76 1.96 0.36 0.36 SSX Computing Pearson’s r (using SP) • Step 2: SSX & SSY XY 6 6 2.4 2.0 4.8 5.76 -2.6 -2.0 5.2 1 2 5 6 1.4 2.0 2.8 3 4 -0.6 0.0 0.0 3 2 -0.6 -2.0 1.2 3.6 4.0 0.0 15.20 0.0 14.0 mean

2 2 2 2 2 = = = = = 4.0 4.0 4.0 0.0 4.0 SSY Computing Pearson’s r (using SP) • Step 2: SSX & SSY XY 6 6 2.4 2.0 4.8 5.76 -2.6 6.76 -2.0 5.2 1 2 5 6 1.4 1.96 2.0 2.8 3 4 -0.6 0.36 0.0 0.0 3 2 -0.6 0.36 -2.0 1.2 3.6 4.0 0.0 15.20 0.0 16.0 14.0 mean

Computing Pearson’s r (using SP) • Step 3: compute r

SSY SSX Computing Pearson’s r (using SP) • Step 3: compute r XY 6 6 2.4 2.0 4.8 4.0 5.76 -2.6 6.76 -2.0 4.0 5.2 1 2 5 6 1.4 1.96 2.0 4.0 2.8 3 4 -0.6 0.36 0.0 0.0 0.0 3 2 -0.6 0.36 -2.0 4.0 1.2 3.6 4.0 0.0 15.20 0.0 16.0 14.0 SP mean

SSY SSX Computing Pearson’s r • Step 3: compute r 15.20 16.0 14.0 SP

SSY SSX Computing Pearson’s r • Step 3: compute r 15.20 16.0

SSX Computing Pearson’s r • Step 3: compute r 15.20

Computing Pearson’s r • Step 3: compute r

Y 6 5 4 3 2 1 X 1 2 3 4 5 6 Computing Pearson’s r • Step 3: compute r • Appears linear • Positive relationship • Fairly strong relationship • .89 is far from 0, near +1

The Correlation Coefficient • Formulas for the correlation coefficient: Conceptual Formula Common alternative

X Y 6 6 1 2 5 6 3 4 3 2 Computing Pearson’s r(using z-scores) • Step 1: compute standard deviation for X and Y (note: keep track of sample or population) • For this example we will assume the data is from a population

2.4 -2.6 6.76 1.4 1.96 -0.6 0.36 -0.6 0.36 3.6 0.0 15.20 Mean SSX Computing Pearson’s r (using z-scores) • Step 1: compute standard deviation for X and Y (note: keep track of sample or population) • For this example we will assume the data is from a population X Y 6 6 5.76 1 2 5 6 3 4 3 2 1.74 Std dev

2.0 -2.0 4.0 2.0 4.0 0.0 0.0 -2.0 4.0 0.0 16.0 SSY Computing Pearson’s r (using z-scores) • Step 1: compute standard deviation for X and Y(note: keep track of sample or population) • For this example we will assume the data is from a population X Y 6 6 2.4 4.0 5.76 -2.6 6.76 1 2 5 6 1.4 1.96 3 4 -0.6 0.36 3 2 -0.6 0.36 3.6 4.0 15.20 Mean 1.74 1.79 Std dev

Computing Pearson’s r (using z-scores) • Step 2: compute z-scores X Y 6 6 2.4 2.0 4.0 1.38 5.76 -2.6 6.76 -2.0 4.0 1 2 5 6 1.4 1.96 2.0 4.0 3 4 -0.6 0.36 0.0 0.0 3 2 -0.6 0.36 -2.0 4.0 3.6 4.0 15.20 16.0 Mean 1.74 1.79 Std dev

Quick check Computing Pearson’s r (using z-scores) • Step 2: compute z-scores X Y 6 6 2.4 2.0 4.0 1.38 5.76 -2.6 6.76 -2.0 4.0 -1.49 1 2 5 6 1.4 1.96 2.0 4.0 0.8 3 4 -0.6 0.36 0.0 0.0 - 0.34 3 2 -0.6 0.36 -2.0 4.0 - 0.34 3.6 4.0 15.20 16.0 0.0 Mean 1.74 1.79 Std dev

Computing Pearson’s r (using z-scores) • Step 2: compute z-scores X Y 6 6 2.4 2.0 4.0 1.38 1.1 5.76 -2.6 6.76 -2.0 4.0 -1.49 1 2 5 6 1.4 1.96 2.0 4.0 0.8 3 4 -0.6 0.36 0.0 0.0 - 0.34 3 2 -0.6 0.36 -2.0 4.0 - 0.34 3.6 4.0 15.20 16.0 Mean 1.74 1.79 Std dev

Quick check Computing Pearson’s r (using z-scores) • Step 2: compute z-scores X Y 6 6 2.4 2.0 4.0 1.38 1.1 5.76 -2.6 6.76 -2.0 4.0 -1.49 -1.1 1 2 5 6 1.4 1.96 2.0 4.0 0.8 1.1 3 4 -0.6 0.36 0.0 0.0 - 0.34 0.0 3 2 -0.6 0.36 -2.0 4.0 - 0.34 -1.1 3.6 4.0 15.20 16.0 0.0 Mean 1.74 1.79 Std dev

Understanding Correlation and Causality in Research