370 likes | 383 Views
Learn how to fit a line to bivariate data using the method of least squares regression. Understand the meaning of slope, intercept, deviations, residuals, and the sum of squared differences. Evaluate the usefulness of the line through r2 and residual plots.
E N D
Chapters 8, 9, 10Least Squares Regression Line Fitting a Line to Bivariate Data
Suppose there is a relationship between two numerical variables. Data: (x1, y1), (x2, y2), …, (xn, yn) Let x be the amount spent on advertising and y be the amount of sales for the product during a given period. You might want to predict product sales for a month (y) when the amount spent on advertizing is $10,000 (x). The letter y is used to denoted the variable you want to predict, called the responsevariable. The other variable, denoted by x, is the explanatory variable.
Simplest Relationship • Simplest equation that describes the dependence of variable y on variable x y = b0 + b1x • linear equation • b1 is the slope • it is the amount by which ychanges when x increases by 1 unit • y-intercept b0 • where the line crosses the y-axis; that is, the value of y when x = 0.
Graph is a line y=b0 +b1x y rise Slope b=rise/run b0 run 0 x
How do you find an appropriate line for describing a bivariate data set? Let’s look at only the blue line. The point (15,44) has a deviation of +4. To assess the fit of a line, we look at how the points deviate vertically from the line. To assess the fit of a line, we need a way to combine the n deviations into a single measure of fit. y = 4 + 2.5x y = 10 + 2x This point is (20,45). The predicted value for y when x = 20 is: = 10 + 2(20) = 50 The deviation of the point (20,45) from the line is: 45 - 50 = -5 What is the meaning of a negative deviation?
The Least Squares (Regression) Line A good line is one that minimizes the sum of squared differences between the points and the line.
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99 1 1 The Least Squares (Regression) Line Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89 Let us compare two lines (2,4) 4 The second line is horizontal w (4,3.2) w 3 2.5 2 w (1,2) (3,1.5) w The smaller the sum of squared differences the better the fit of the line to the data. 2 3 4
Criterion for choosing what line to draw: method of least squares • The method of least squares chooses the line that makes the sum of squares of the residuals as small as possible • This line has slope b1 and intercept b0 that minimizes
Scatterplot with least squares prediction line(xi, yi): (3.4, 5.5) (3.8, 5.9) (4.1, 6.5) (2.2, 3.3)(2.6, 3.6) (2.9, 4.6) (2, 2.9) (2.7, 3.6) (1.9, 3.1) (3.4, 4.9)
Observed y, Predicted y predicted y when x=2.7 = b0 + b1x = b0 + b1*2.7 2.7
Car Weight, Fuel Consumption Example, cont. (xi, yi): (3.4, 5.5) (3.8, 5.9) (4.1, 6.5) (2.2, 3.3) (2.6, 3.6) (2.9, 4.6) (2, 2.9) (2.7, 3.6) (1.9, 3.1) (3.4, 4.9)
The Least Squares Line Always goes Through ( x, y ) (x, y ) = (2.9, 4.39)
Using the least squares line for prediction. Fuel consumption of 3,000 lb car? (x=3)
Be Careful! Fuel consumption of 500 lb car? (x = .5) x = .5 is outside the range of the x-data that we used to determine the least squares line
Avoid GIGO! Evaluating the least squares line • Create scatterplot. Approximately linear? • Calculate r2, the square of the correlation coefficient • Examine residual plot
r2 : The Variation Accounted For • The square of the correlation coefficient r gives important information about the usefulness of the least squares line
r2: important information for evaluating the usefulness of the least squares line -1 ≤ r ≤ 1 implies 0 ≤ r2 ≤ 1 The square of the correlation coefficient, r2, is the fraction of the variation in y that is explained by the least squares regression of y on x. The square of the correlation coefficient, r2, is the fraction of the variation in y that is explained by differences in x.
March Madness: S(k) Sagarin rating of kth seeded team; Yij =Vegas point spread between seed i and seed j, i<j 94.8% of the variation in point spreads is explained by the variation in Sagarin ratings.
SAT scores: result r2 = (-.86845)2= .7542 Approx. 75.4% of the variation in mean SAT math scores is explained by differences in the percent of seniors taking the SAT.
Avoid GIGO! Evaluating the least squares line • Create scatterplot. Approximately linear? • Calculate r2, the square of the correlation coefficient • Examine residual plot
Residuals • residual =observed y - predicted y = y - y • Properties of residuals • The residuals always sum to 0 (therefore the mean of the residuals is 0) • The least squares line always goes through the point (x, y)
Graphicallyresidual = y - y y yi yi ei=yi - yi X xi
Residual plots A careful look at the residuals can reveal many potential problems. A residual plot is a graph of the residuals. • A residual plot is a scatterplot of the (x, residual) pairs. • Residuals can also be graphed against the predicted y-values • We make a scatterplot of the residuals in the hope of finding…NOTHING! • Isolated points or a pattern of points in the residual plot indicate potential problems.
Car weight, fuel consumption, continued Plot the residuals against the Weight (x)
Residuals: Sagarin Ratings and Point Spreads Yij Predicted Yij Residuals 20 23.48573586 -3.485735859 24 21.3717734 2.628226598 18 13.96719139 4.032808608 11 11.52185104 -0.521851036 6 5.774158519 0.225841481 8.5 7.613877198 0.886122802 4 1.683355495 2.316644505 4 2.186135755 1.813864245 28 27.26801463 0.731985367 16 15.53266629 0.467333708 11.5 10.56199781 0.938002187 12 10.11635167 1.883648327 4 5.397073324 -1.397073324 7 6.836853159 0.163146841 -1.5 1.500526309 -3.000526309 2 1.946172449 0.053827551 Yij Predicted Yij Residuals 25 23.58857728 1.411422725 18.5 18.34366502 0.156334982 10.5 12.85878945 -2.358789455 11.5 10.95050983 0.549490168 4.5 2.597501422 1.902498578 5 6.631170326 -1.631170326 4 3.203123099 0.796876901 -3.5 0.095026946 -3.595026946 23 24.15991848 -1.15991848 20.5 21.24607834 -0.746078337 18 20.0919691 -2.091969104 10.5 11.62469245 -1.124692453 9 6.836853159 2.163146841 7 5.979841353 1.020158647 2 3.283110867 -1.283110867 5 6.745438567 -1.745438567