170 likes | 237 Views
Learn how to make predictions using linear regression and understand coefficients of determination, residuals, and the limitations of correlation and extrapolation.
E N D
Linear Regression Chapter 10, Part 2
Predictions with Scatterplots • Last Time: A scatterplot gives a picture of the relationship between two quantitative variables. • One variable is explanatory, and the other is the response. • Today:If we know the value of the explanatory variable, can we predict the value of the response variable?
The Regression Line • To make predictions, we’ll find a straight line that is the “best fit” for the points in the scatterplot. This is not so simple….
Regression Line in JMP • Start by making a scatterplot. • Red Triangle menu -> “Fit Line.” • The equation of the regression line appears under the “Linear Fit” group. • JMP uses column headings as variable names (instead of x and y). • Example from the Cars 1993 file: • MaxPrice = 2.3139014 + 1.1435971*MinPrice
Predicted Values • We use the equation of the regression line to make predictions about… • Individuals not in the original data set. • Later measurements of the same individuals. • Example: In 1994, a vehicle had a Min. Price of $15,000. Use the previous data to predict the Max. Price. • You can do this by hand from the equation: MaxPrice = 2.3139014 + 1.1435971*MinPrice • 2.3139014+1.1435971*(15) = 19.4678579
Are the Predictions Useful? • In some cases, the regression line is more useful for predicting values. Consider the following examples (from Cars 1993):
Coefficient of Determination • If the scatterplot is well-approximated by a straight line, the regression equation is more useful for making predictions. • Correlation is one measure of this. • The square of the correlation has a more intuitive meaning: What proportion of variation in the Response Variable is explained by variation in the Explanatory Variable? • JMP: “RSquare” under “Summary of Fit”
Coefficient of Determination • In predicting Max. Price from Min. Price, we had RSquare = 0.822202. • About 82% of the variation in Max. Price is explained by a variation in Min. Price. • In predicting Highway MPG from Engine size, we have RSquare = 0.392871 • Only 39% of the variation in Highway MPG is explained by a variation in Engine Size.
Coefficient of Determination • RSquare takes values from 0 to 1. • For values close to 0, the regression line is not very useful for predictions. • For values close to 1, the regression line is more useful for making predictions. • RSquare makes no distinction between positive and negative association of variables.
Residuals • For each individual in the data set we can compute the difference (error) between the actual and predicted values of the response variable. This difference is called a residual: Residual = (actual value) – (predicted value) • In JMP: Click the red triangle by “Linear Fit” and select “Save Residuals” from the drop-down menu. You can also “Plot Residuals.”
How does JMP find the Regression Line? • JMP uses the most popular method, Ordinary Least Squares (OLS). • To measure how a given line fits the data: • Compute all residuals, take the square of each. • Add up the results to get a “total error.” • The closer this total is to zero, the better the line fits the data. Choose the line with the smallest “total error.” • (Thankfully) JMP takes care of the details.
Limitations of Correlation and Linear Regression: • Both describe linear relationships only. • Both are sensitive to outliers. • Beware of extrapolation: predicting outside of the given range of the explanatory variable. • Beware of lurking variables: other factors that may explain a strong correlation. • Correlation does not imply causality!
Beware Extrapolation! • A child’s height was plotted against her age... • Can you predict her height at age 8 (96 months)? • Can you predict her height at age 30 (360 months)?
Beware Extrapolation! • Regression line:y = 71.95 + .383 x • Height at 96 months? y = 94.93cm (3' 6'') • Height at 360 months? y = 209.8cm (6’ 10'') • Height at birth (x = 0)? • y = 71.95cm (2’ 4”)
Beware Lurking Variables! • Although there may be a strong correlation (statistical relationship) between two variables, there might not be a direct practical (cause-and-effect) relationship. • A lurking variable is a third variable (not in the scatterplot) that might cause the apparent relationship between explanatory and response variables.
Example: Pizza vs. Subway Fare • The regression line to the right shows a strong correlation (0.9878) between the cost of: • A slice of pizza • Subway fare • Q: Does the price of pizza affect the price of the subway?
Caution:Correlation Does Not Imply Causation • In a study of emergency services, it was noted that larger fires tend to have more firefighters present. • Suppose we used: • Explanatory Variable: Number of firefighters • Response Variable: Size of the fire • We would expect a strong correlation. • But it’s ludicrous to conclude that having more firefighters present causes the fire to be larger.