E N D
RMTD 404 Lecture 9
Correlation & Regression In two independent samples t-test, differences between the means of independent variable groups on the dependent variable is a measure of association—if the group means differ, then there is a relationship between the independent and dependent variable. But, it is useful to make a distinction between statistical tests that evaluate differences and statistical tests that evaluate association. We have already seen that difference statistics and association statistics provide similar types of information (recall d and which are both effect size indicators). This chapter deals with two general topics: Correlation: Statistics that depict the strength of a relationship between two variables. Regression: Applications of correlational statistics to prediction problems.
We typically begin depicting relationships between two variables using a scatterplot—a bivariate plot that depicts three key characteristics of the relationship between two variables. • Strength: How closely related are the two variables? (Weak vs strong) • Direction: Which values of each variable are associated with values of the other variable? (Positive vs negative) • Shape: What is the general structure of the relationship? (Linear vs non-linear) • By convention, when we want to use one variable as a predictor of the other variable (called the criterion variable), we put the predictor on the X axis and the criterion on the Y axis. (NOTE: we can still think of these as IV’s and DV’s)
Scatterplots Strong Positive Linear Strong Negative Linear Weak Positive Linear Strong Curvilinear
When we want to show that a certain function can describe the relationship and that that function is useful as a predictor of the Y variable based on X, we include a regression line—the line that best fits the observed data.
Covariance An important concept relating to correlation is the covariance of two variables (covXY or sXY—notice that the latter designates the covariance as a measure of dispersion between X and Y). The covariance reflects the degree to which two variables vary together or covary. Notice that the equation for the covariance is very similar to the equation for the variance, only the covariance has two variables. When the covariance is a large, positive number, Y tends to be large when X tends to be large (both are positive). When the covariance is a large, negative number, Y tends to be large and positive when X tends to be large but negative. When the covariance is near zero, there is no clear pattern like this—positive values tend to be cancelled by negative values of the product.
Correlation Coefficient A problem occurs with the covariance—it is in raw score units, so we cannot tell much about whether the covariance is indeed large enough to be important by looking at it. It changes as the scales used to measure the variables change. The solution to this problem is to standardize the statistic by dividing by a measure of the spread of the relevant distributions. Thus, the correlation coefficient is defined as: Because sXY cannot exceed sXsY, the limit of |r| is 1.00. Hence, one way to interpret r is as a measure of the degree to which the covariance reaches is maximum possible value—when the two variables covary as much as they possibly could, the correlation coefficient equals 1.00.
Here is an example from SPSS. Variable 1: Reading scores. Variable 2: Math scores. Covariance sRM=68.069; r=.714 Positive and strong linear relationship is found between the two variables.
In R: plot(bytxrstd,bytxmstd) cor(bytxrstd,bytxmstd, use="pairwise.complete") [1] 0.7142943
Adjusted r Although we usually report the value of the Pearson Product Moment correlation, there is a problem with that statistic—it is a biased estimate of the population correlation (ρ—rho). When the number of observations is small, the sample correlation will be larger than the population correlation. To compensate for this problem, we can compute the adjusted correlation coefficient (radj), which is an unbiased estimate of the population correlation coefficient. For our reading & math example, the computation gives us the following. Because our sample size is large, the correction does little to change the value of r. radj<-sqrt(1-(((1-71^2)*269)/(268))) radj [1] 0.7086957
Hypothesis Testing for r Occasionally, we want to perform a hypothesis test on r. That is, we want to determine the probably that an observed r came from a hypothetical null parameter (ρ). The most common use of hypothesis testing relating to r is the test of the null hypothesis, Ho: ρ = 0. When N is large and ρ = 0, the sampling distribution of r is approximately normal in shape and is centered on 0. The following t statistic can be formed which is distributed as t with N – 2 degrees of freedom.
Returning to the reading and math scores example, our r was .714. We can test the null hypothesis that correlation came from a population in which reading scores and math scores are unrelated. Since we had 270 participants in our study, then the t statistic would be computed as follows. With 268 degrees of freedom, the p-value for this statistic is less than .0001 (critical value for a two-tailed test is 1.96). Hence, we would reject the null hypothesis and conclude that there is a non-zero correlation between reading and math scores in the population.
Regression Line So, the correlation coefficient tells us the strength of the relationship between two variables. If this relationship is strong, then we can use knowledge about the values of one variable to predict the values of the other variable. Recall that the shape of the relationship being modeled by the correlation coefficient is linear. Hence, r describes the degree to which a straight line describes the values of the Y variable across the range of X values. If the absolute value of r is close to 1, then the observed Y points all lie close to the best-fitting line. As a result, we can use the best-fitting line to predict what the values of the Y variable will be for any given value of X. To make such a prediction, we obviously need to know how to create the best-fitting (a.k.a. regression) line.
Recall that the equation for a line takes the form Y = bX + a. We will put a hat (^) over the Y to indicate that, for our purposes, we are using the linear equation to estimate Y. Note that all elements of this equation are estimated (from data). where is the value of Y predicted by the linear model for the ith value of X. b is the estimated slope of the regression line (the difference in associated with a one-unit difference in X). a is the estimated intercept (the value of when X = 0). Xi is the ith value of the predictor variable.
Our task is to identify the values of a and b that produce the best-fitting linear function. That is, we use the observed data to identify the values of a and b that minimize the distances between the observed values (Y) and the predicted values ( ). But, we can’t simply minimize the difference between Y and (called the residual from the linear model) because any line that intersects ( , ) on the coordinate plane will result in a average residual equal to 0. To solve this problem, we take the same approach used in the computation of the variance—we find the values of a and b that minimize the squared residuals. This solution is called the least squares solution.
Fortunately, the least squares solution is simple to find, given statistics that you already know how to compute. These values minimize (the sum of the squared residuals).
As an example, consider the data on reading and math. We are interested in determining whether reading scores would be useful in predicting math scores. We got the following descriptive statistics for the two variables using the student dataset. From this, we can easily compute sXY. And from this, we can compute a and b.
The predicted model (regression line) can be written as: • So what does this regression line and its parameters tell us? • The intercept tells us that the best prediction of math score when reading score= 0 equals 12.32. • The slope tells us that, for every 1-point increase in reading scores, we get an increase in math of .76 points. • The covariance and correlation (as well as the slope) tell us that the relationship between reading and math is positive. That is, reading score tends to increase when math score increases. • Note, however, that it is incorrect to ascribe a causal relationship between reading and math in this context.
Standard Errors An important question in regression is “does the regression line do a good job of explaining the observed data?” One way to address this question is to state how much confidence we have in the predicted value. That is, how precise is our prediction? Let’s begin with a simple case to demonstrate how we already know a good bit about estimate precision. Suppose that you don’t know anything about reading scores and you want to estimate what a student’s math score is. The only thing that you know is that the mean math score and the standard deviation of math scores. If you were to randomly choose a student from that population, what is the best guess at what that student’s math will be?
When we know nothing, our best guess of the value of the criterion variable is the mean of the criterion variable. And you would expect 95% of the observed cases to lie within 1.96 standard deviations of the mean (assuming the population of math scores is normally distributed). This statement would not change if you had knowledge about a student’s reading score and r = 0 described the relationship between reading and math in the population. That is, if there is no relationship between X and Y (knowing something about X gives you no additional information about Y), then your best predicted value of Y is the mean of Y and the precision of that estimate is dictated by the standard deviation of Y (or the sample variance of Y).
Let’s simplify the equation for the sample variance so that we can extend the equation to describe the precision of predictions when r does not equal 0 on the next slide. That is, let’s denote the numerator of the equation as the sum of squares (i.e., the sum of the squared deviations of the observed values around their mean--SStotal). It is called the “total” sum of squares because it accounts for the entire difference between the observations. The denominator is the degrees of freedom
Now, let’s look at the standard error—the standard deviation of a sampling distribution--in regression. We can define the standard error of the estimate (sY.X) as the standard deviation of observed Y values around the value of Y predicted based on our knowledge of X. That is, sY.X is the standard deviation of Y around for any given value of X. Computationally, sY.X is defined as the square root of the sum of the squared residuals over their degrees of freedom (called the residual because it is the deviation of the observations from the predictions) or the root mean square error (RMSE).
Obviously, from its form, we can see that this is a standard deviation. The only difference is that we are computing deviances of observed Ys from predicted values ( ) rather than the mean. That is, the standard error is the standard deviation of the conditional distributions of Y at each level of X. The square of the standard error of estimate (a variance) is also known as the residual variance or the error variance.
So, the square of the standard error of the estimate is another special type of variance (like the square of the standard error of the mean)—a sum of squared residuals divided by its degrees of freedom. We can also state the error or residual variance as a function of the correlation coefficient (r). Note that when the sample size is very large, approaches one, so the equation simplifies to. Hence, we can estimate the value of the standard error of the estimate if we know the correlation coefficient between X and Y and the standard deviation of Y.
For our previous example, we can obtain the error variance for the regression of reading scores on math scores as follows. Hence, if we were to assume that the observed Y values were normally distributed around the prediction line, we would expect 95% of the observed Y values to lie within ±1.96 7.06 points (or within 13.84 points) of the predicted values.
The first part of the SPSS output gives us the error variance as the mean square residual. The SSresidual is designated as the SSerror and MSerror = SSerror/dferror. Also, the root mean square residual equals the standard error of the estimate. In this case, it equals 7.07 ( ). Hence, we know that, on average, the observed math scores are about 7.07 points from the prediction line. Again, this differs only slightly from what we computed by hand due to rounding error.
This points out an important thing about the relationship between r and the standard error of the estimates—for a given value of the standard deviation of Y, the size of the standard error is proportional to the size of the standard deviation of Y by a function of r—the strength of the relationship between X and Y. When r = 0, the value of the standard error of the estimates equals the value of the standard deviation of Y. When r = 1, the value of the standard error of the estimates equals 0. And, when r is between 0 and 1, is the relative size of the standard error of the estimates versus the standard deviation of the sample. sY.X = sY sY.X = 0
r2 Another important way of stating the relationship between variability in the sample and variability in the estimates as a function of r relates the sums of squares for these two measures. Which can be solved for r2 so that Hence, r2 is a proportion—the proportion of observed variability that is explained by the relationship between X and Y. When the residual variability is small, r2 approaches 1. When the residual variance is large, r2 approaches 0.
In our reading and math example, about 51% (.7142) of the variability in math is explained by its relationship with reading scores. Another way of stating this is that 51% of the variance in math scores is accounted for by its covariance with reading scores.
r2 To summarize, there are several sources of variance in the regression equation between X and Y. Variability of the predictor variable Variability of the outcome Variability explained by the model Variability not explained by the model Note: Given that r2 tells you the amount of variability in the Y variable that is explained by its relationship with the X variable, we can use r2 as an effect size indicator. In fact, Cohen (1988) suggests the following values for a rule-of-thumb: small = .01, medium = .09, large = .25.
In SPSS, the r and r2 are reported, along with the adjusted r2. In R: summary(result1) Call: lm(formula = bytxmstd ~ bytxrstd) Residuals: Min 1Q Median 3Q Max -17.01118 -4.98226 0.07116 4.73102 16.43518 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.17730 2.40505 5.063 7.68e-07 *** bytxrstd 0.76211 0.04561 16.709 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.07 on 268 degrees of freedom (30 observations deleted due to missingness) Multiple R-squared: 0.5102, Adjusted R-squared: 0.5084 F-statistic: 279.2 on 1 and 268 DF, p-value: < 2.2e-16
Residual Analysis One way to evaluate the quality of the regression model is to examine the residuals. By examining residuals, you’ll be able to tell whether your linear model is appropriate for the data, the degree to which the data conform to the linear model, and which specific cases do not jibe with the linear model. The plot that is often used when performing a residual analysis is a scatter plot of the residuals and the predicted values. A scatter plot (aka a residual plot) allows us to identify patterns in the residuals. The scatter of the residuals should be of equal magnitude across the range of the predicted value and should increase in density as the residual falls closer to the predicted value.
Residual Analysis Cohen & Cohen suggest that non-random patterns within residual plots may indicate specific types of problems in the data. These plots show on the X-axis and on the Y-axis. Curvilinear = Outliers= Non-linear relationship Special cases or data errors Heteroscedasticity = Slope = Invalid inferences Omitted time IV
Hypothesis Testing of b The test of the null hypothesis that r = 0 is the same as the test of the null hypothesis that β (the parameter estimated by the slope, b) equals 0. That is, if there is no relationship between X and Y, then the correlation equals zero. This is the same thing as the slope of the regression line equals zero, which also translates to a situation in which the mean of Y is the best predictor and the standard error of the estimate equals the standard deviation of Y. Recall that the t-test compares an observed parameter to a hypothetical (null) value, dividing the difference by the standard error. Hence, we need a standard error for b. Hence, the t-test for comparing b to a null parameter (typically set to 0) is: where t has N – 2 df
For the math and reading data, the standard error of b is computed below. b = 0.76 sY.X = sX = 9.47 We had 270 participants, sb is computed as follows. So the t statistic to test the null hypothesis that b=0 equals
Hypothesis Testing for b in SPSS The SPSS output contains this statistical test. In this example, we see that the slope for reading scores regressed on math scores is statistically significant (t=16.71, p<.0001). That is, the slope is non-zero. In R: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.17730 2.40505 5.063 7.68e-07 *** bytxrstd 0.76211 0.04561 16.709 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Similarly, we can create a confidence interval around the observed b using the following extension of the t-test formula. For the reading & math example, the two-tailed critical t for a = .05 with 269 degrees of freedom would equal 1.96, so the confidence interval would be computed as follows. Given that 0 does not fall within these limits, we would reject the null hypothesis that b came from a population in which b = 0.
Assumptions In both correlation and regression, it is assumed that the relationship between X and Y is linear rather than curvilinear. Different procedures are available for modeling curvilinear data. If we simply wish to describe the relationship in the observed data or express the proportion of the variance of Y that is accounted for by its linear relationship with X, we need no additional assumptions.
However, inferential procedures for regression (i.e., issues relating to b and ) rely on two additional assumptions about the data being modeled. • Homogeneity of variance in arrays: The residual variance of Y conditioned on X at each level of X is assumed to be equal. This is equivalent to the homogeneous variance assumption we made with the t-test. The observed variances do not have to be equal, but they have to be close enough. We can examine the residual plot to get a sense about this assumption • Normality of conditional arrays: The distribution of observed Y values around the predicted Y value at each level of X is assumed to be normally distributed. This is necessary because we use the standard normal distribution in testing hypotheses. Histograms can be used to check for normality as well as Q-Q plots.
If we wish to draw inferences about the correlation coefficient, on the other hand, we need to make only one assumption (albeit a rather demanding assumption): Bivariate Normality: If we wish to test hypotheses about r or establish confidence limits on r, we must assume that the joint distribution of Xs and Ys is normal.
Factors that Influence the Correlation • The correlation coefficient can be substantially affected by characteristics of the sample. Specifically, there are three potential problems that might lead to spuriously high or low correlation coefficients. • Range Restriction: If the range of Xs or Ys is restricted (e.g., a ceiling or floor effect or omission of data from certain sections of the population or sample), the correlation statistic (r) will likely underestimate r (although it is possible that it will overestimate it—like when restriction of range eliminates a portion of a curvilinear relationship). r=.72 r=.63
Heterogeneous Samples: A second problem—one that is more likely to make r an overestimate of r—arises when you compare two variables (X and Y), but there are large differences on one of these variables with respect to a third variable (Z). For example, suppose that we are interested in the relationship between comfort with technology and scores on a technology-based test. Also suppose that males and females exhibit very large differences in comfort with technology. The joint distribution of males and females could give us a false impression of the relationship between comfort and scores. r=.34 r=.50 r=.34
Outliers: Extreme values of Y or X can artificially inflate or deflate r as an estimate of r. In most cases, these outliers are substantively interesting cases or are error-laden cases. r=.52 r=.36 r=.36 r=.01
Regression steps in SPSS: • Analyze • Regression • Linear • Input dependent (outcome) variable • Input independent (predictor) variable • Press OK • The Model Summary table provides R2 and adj R2 • The ANOVA table provides the SSR, SSE, MSE, F, and overall predictive capacity of the IV. • The coefficients table is most important – it provides the parameter estimates for b and a along with tests of their statistical significance.