230 likes | 403 Views
Regression. Regression. Correlation and regression are closely related in use and in math. Correlation summarizes the relations b/t 2 variables. Regression is used to predict values of one variable from values of the other (e.g., SAT to predict GPA). Basic Ideas (2). Sample value:
E N D
Regression • Correlation and regression are closely related in use and in math. • Correlation summarizes the relations b/t 2 variables. • Regression is used to predict values of one variable from values of the other (e.g., SAT to predict GPA).
Basic Ideas (2) • Sample value: • Intercept – place where X=0 • Slope – change in Y if X changes 1 unit. Rise over run. • If error is removed, we have a predicted value for each person at X (the line): Suppose on average houses are worth about $75.00 a square foot. Then the equation relating price to size would be Y’=0+75X. The predicted price for a 2000 square foot house would be $150,000.
Linear Transformation • 1 to 1 mapping of variables via line • Permissible operations are addition and multiplication (interval data) Add a constant Multiply by a constant
Linear Transformation (2) • Centigrade to Fahrenheit • Note 1 to 1 map • Intercept? • Slope? 240 212 degrees F, 100 degrees C 200 160 120 Degrees F 80 40 32 degrees F, 0 degrees C 0 0 30 60 90 120 Degrees C Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32. Slope is 1.8. When Cent goes from 0 to 100 (rise), Fahr goes from 32 to 212, and 212-32 = 180. Then 180/100 =1.8 is rise over run is the slope. Y = 32+1.8X. F=32+1.8C.
Regression Line (1) Basics 1. Passes thru both means. 2. Passes close to points. Note errors. 3. Described by an equation.
Regression Line (2) Slope Equation for a line is Y=mX+b in algebra. In regression, equation usually written Y=a+bX Y is the DV (weight), X is the IV (height), a is the intercept (-327) and b is the slope (7.15). The slope, b, indicates rise over run. It tells how many units of change in Y for a 1 unit change in X. In our example, the slope is a bit over 7, so a change of 1 inch is expected to produce a change a bit more than 7 pounds.
Regression Line (3) Intercept The Y intercept, a, tells where the line crosses the Y axis; it’s the value of Y when X is zero. The intercept is calculated by: Sometimes the intercept has meaning; sometimes not. It depends on the meaning of X=0. In our example, the intercept is –327. This means that if a person were 0 inches tall, we would expect them to weigh –327 lbs. Nonsense. But if X were the number of smiles,then a would have meaning.
Correlation & Regression Correlation & regression are closely related. 1. The correlation coefficient is the slope of the regression line if X and Y are measured as z scores. Interpreted as SDY change with a change of 1 SDX. • For raw scores, the slope is: The slope for raw scores is the correlation times the ratio of 2 standard deviations. (These SDs are computed with (N-1), not N). In our example, the correlation was .96, so the slope can be found by b = .96*(33.95/4.54) = .96*7.45 = 7.15. Recall that . Our intercept is 150.7-7.15*66.8 -327.
Correlation & Regression (2) • The regression equation is used to make predictions. • The formula to do so is just: • Suppose someone is 68 inches tall. Predicted weight is • -327+7.15*68 = 159.2.
Review • What is the slope? What does it tell or mean? • What is the intercept? What does it tell or mean? • How are the slope of the regression line and the correlation coefficient related? • What is the main use of the regression line?
Test Questions B D A C • What is the approximate value of the intercept for Figure C? • 0 • 10 • 15 • 20
In a regression line, the equation used is typically . What does the value a stand for? independent variable intercept predicted value (DV) slope Test Questions
Regression of Weight on Height Correlation (r) = .94. Regression equation: Y’=-361.86+6.97X
N Ht Wt Y' Error 1 61 105 108.19 -3.19 2 62 120 115.16 4.84 3 63 120 122.13 -2.13 4 65 160 136.06 23.94 5 65 120 136.06 -16.06 6 68 145 156.97 -11.97 7 69 175 163.94 11.06 8 70 160 170.91 -10.91 9 72 185 184.84 0.16 10 75 210 205.75 4.25 M 67 150 150.00 0.00 SD 4.57 33.99 31.85 11.89 Variance 20.89 1155.56 1014.37 141.32 Predicted Values & Errors Numbers for linear part and error. Note M of Y’ and Residuals. Note variance of Y is V(Y’) + V(res).
Error variance In our example, (Heiman’s notation for error is not standard. ) Standard error of the Estimate – average distance from prediction In our example
Variance Accounted for (Heiman’s notation for error is not standard. ) The basic idea is to try maximize r-square, the variance accounted for. The closer this value is to 1.0, the more accurate the predictions will be.
Sample Exam Data from Previous Class Exam 1 Exam 2 86.00 56.00 98.00 70.00 70.00 76.00 84.00 82.00 82.00 74.00 92.00 94.00 92.00 78.00 72.00 56.00 96.00 66.00 82.00 72.00 A sample of 10 scores from both exams Assuming these are representative, what can you say about the exams? The students?
Scatterplot & Boxplots of 2 Exams Exam 1 Exam 2
Scatterplot with means and regression line Note that the correlation, r, is .42 and the squared correlation, R2, is .177. R2 is also the variance accounted for. We can predict a bit less than 20 percent of the variance in Exam 2 from Exam 1.
Predicted Scores Predicted Exam 2 = 20.895 + .598*Exam1 For example, if I got 85 on Exam 1, then my predicted score for Exam 2 is 20.895+.598*85 = 71.73 = 72 percent