250 likes | 368 Views
Psychology 10. Analysis of Psychological Data April 30, 2014. The plan for today. Review correlation. Inference for correlation. Another example of correlation. Introduction to regression. Where do regression estimates come from? Residuals and predicted values.
E N D
Psychology 10 Analysis of Psychological Data April 30, 2014
The plan for today • Review correlation. • Inference for correlation. • Another example of correlation. • Introduction to regression. • Where do regression estimates come from? • Residuals and predicted values. • Understanding correlation in the context of regression.
Correlation and causation • Note that the existence of a correlation does not necessarily imply a causal relationship. • Example: fire damage and number of units responding to fire. • Example: rum consumption and religious activity.
Blood pressure (cont.) • We calculated that the correlation coefficient was .66. • Let’s try another example using the same data set.
Computing correlation • Computational formula for correlation:
Useful intermediate values • SXY = 93,045. • SX2 = 51,493. • SY2 = 172,985. • (SX) = 703. • (SY) = 1,311.
Inference for correlation • We have calculated two correlations now that are related to our blood pressure data set: one large and one small. • We might wonder in each case whether the population correlation is different from zero. • H0 : r = 0. • Table 6 (df = N – 2).
Assumptions for inference • The relationship must be linear. • The pairs of observations must be independent. • The variables must be bivariate normal. • (This last assumption can be relaxed, but it will be easier to discuss how when we talk about inference for regression.)
Simple linear regression • Recall that there were two questions about a linear relationship that might interest us: • How strong is the relationship? • What is the relationship? • Regression addresses the second question.
The equation of a line • Several weeks ago, we talked about linear transformations of the form Y = a + bX. • Such transformations were called linear because equations of that form define lines. • The first parameter, a, describes the value of Y when X equals zero. • The second parameter, b, describes the slope of the line (increase in Y for each unit increase in X).
Regression estimates • Estimating the regression line, then, amounts to estimating the intercept and slope. • This is very easy if we have already calculated the correlation. • b = SP / SSX. • a = MY – bMX.
Regression estimates (cont.) • For the regression in which pulse rate predicts systolic blood pressure, SP was 881.7and SSXwas 2,072.1, so the estimated slope is b = 881.7 / 2072.1 = 0.4255104. • The means of pulse and blood pressure were 70.3and 131.1, so the estimated intercept is a = 131.1 – 0.4255104× 70.3) = 101.1866.
Adding a regression line to a scatterplot • Chose two values of X near the left and right edges of the plot. • Use the regression equation to predict a Y for each of those Xs. • Plot those two points, and connect them with a line. • Example: for X = 90and 50, the predicted Ys are 101.1866 + 0.4255×90 = 139.48, and 101.1866 + 0.4255×50= 122.46.
Important note about regression • The formula for correlation involves X and Y in exactly the same way. • Hence, the correlation of X and Y is the same as the correlation of Y and X. • The same is not true for regression. • We describe the equation Y = a + bX as “the regression of Y on X.” • This is not the same as the regression of X on Y.
Where do regression estimates come from? • It will be easier to understand the reason for these different roles of Y and X if we understand where regression estimates come from. • The criterion used for estimating the line is called “least squares.” • Imagine that we are considering a candidate line. We can calculate a predicted value for each case in our data set.
Where do regression estimates come from? (cont.) • For each of those cases, calculate the difference between the observed Y and the predicted Y. • If we square those differences to make them all positive, the “best fitting” line is the one that minimizes the sum of the squared differences. • This is called the “least squares criterion” or “ordinary least squares.”
Where do regression estimates come from? (cont.) • The formulas we use produce least squares estimates of the regression line. • The term “residual” is shorthand for the difference between an observed variableY and its predicted value.
Where do regression estimates come from? (cont.) • Note that the line for the regression of Y on X minimizes residuals in the vertical direction on the scatterplot. • The line that minimizes residuals in the horizontal direction is the line for the regression of X on Y.
Example (cont.) • Correlation • Regression • Draw line on scatterplot
Useful values • Sum for midterm 1 = 2,363.5. • Sum for midterm 2 = 2,890.5. • Sum(midterm 1 squared) = 48,193.75. • Sum(midterm 2 squared) = 78,730.75. • Sum(midterm 1 * midterm 2) = 58,697.5. • N = 118.