110 likes | 189 Views
Section 7.3 ~ Best-Fit Lines and Prediction. Introduction to Probability and Statistics Ms. Young. Objective. Sec. 7.3.
E N D
Section 7.3 ~ Best-Fit Lines and Prediction Introduction to Probability and Statistics Ms. Young
Objective Sec. 7.3 • After this section you will become familiar with the concept of a best-fit line for a correlation, recognize when such lines have predictive value and when they may not, understand how the square of the correlation coefficient is related to the quality of the fit, and qualitatively understand the use of multiple regression.
Sec. 7.3 Line of Best-Fit • The best-fit line (or regression line) on a scatterplot is a line that lies closer to the data points than any other possible line • This can be useful to make predictions based on existing data • The line of best-fit should have approximately the same number of points above it as it has below it and it does not have to start at the origin • The precise line of best-fit can be calculated by hand, but is very tedious so often times it is estimated “by eye” or by using a calculator
Sec. 7.3 Cautions in Making Predictions from Best-Fit Lines • Don’t expect a best-fit line to give a good prediction unless the correlation is strong and there are many data points • If the sample points lie very close to the best-fit line, the correlation is very strong and the prediction is more likely to be accurate • If the sample points lie away from the best-fit line by substantial amounts, the correlation is weak and predictions tend to be much less accurate
Sec. 7.3 Cautions in Making Predictions from Best-Fit Lines • Don’t use a best-fit line to make predictions beyond the bounds of the data points to which the line was fit • Ex. ~ The diagram below represents the relationship between candle length and burning time. The data that was collected dealt with candles that all fall between 2 in. and 4 in. Using the line of best fit to make a prediction far off from these lengths would most likely be inappropriate. • According to the line of best-fit, a candle with a length of 0 in. burns for 2 minutes, an impossibility
Sec. 7.3 Cautions in Making Predictions from Best-Fit Lines • A best-fit line based on past data is not necessarily valid now and might not result in valid predictions of the future • Ex. ~ Economists studying historical data found a strong correlation between unemployment and the rate of inflation. According to this correlation, inflation should have risen dramatically in the recent years when the unemployment rate fell below 6%. But inflation remained low, showing that the correlation from old data did not continue to hold. • Don’t make predictions about a population that is different from the population from which the sample data were drawn • Ex. ~ you cannot expect that the correlation between aspirin consumption and heart attacks in an experiment involving only men will also apply to women • Remember that a best-fit line is meaningless when there is no significant correlation or when the relationship is nonlinear • Ex. ~ there is no correlation between shoe size and IQ, so even though you can draw a line of best-fit, it is useless in making any conclusions
Sec. 7.3 Example 1 State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not. • You’ve found a best-fit line for a correlation between the number of hours per day that people exercise and the number of calories they consume each day. You’ve used this correlation to predict that a person who exercises 18 hours per day would consume 15,000 calories per day. • This prediction would be beyond the bounds of the data collected and should therefore not be trusted • There is a well-known but weak correlation between SAT scores and college grades. You use this correlation to predict the college grades of your best friend from her SAT scores. • Since the correlation is weak, that means that there is much scatter in the data and you should not expect great accuracy in the prediction • Historical data have shown a strong negative correlation between birth rates in Russia and affluence. That is, countries with greater affluence tend to have lower birth rates. These data predict a high birth rate in Russia. • We cannot automatically assume that the historical data still apply today. In fact, Russia currently has a very low birth rate, despite also having a low level of affluence.
Sec. 7.3 Example 1 Cont’d… • A study in China has discovered correlations that are useful in designing museum exhibits that Chinese children enjoy. A curator suggests using this information to design a new museum exhibit for Atlanta-area school children. • The suggestion to use information from the Chinese study for an Atlanta exhibit assumes that predictions made from correlations in China also apply to Atlanta. However, given the cultural differences between China and Atlanta, the curator’s suggestion should not be considered without more information to back it up. • Scientific studies have shown a very strong correlation between children’s ingesting of lead and mental retardation. Based on this correlation, paints containing lead were banned • Given the strength of the correlation and the severity of the consequences, this prediction and the ban that followed seem quite reasonable. In fact, later studies established lead as an actual cause of mental retardation, making the rationale behind the ban even stronger.
Sec. 7.3 The Correlation Coefficient and Best-Fit Lines • Recall that the correlation coefficient (r) refers to the strength of a correlation • The correlation coefficient can also be used to say something about the validity of predictions with best-fit lines • The coefficient of determination, r²,is the proportion of the variation in a variable that is accounted for by the best-fit line • Ex. ~ The correlation coefficient for the diamond weight and price from the scatterplot on p.307 is r = 0.777, so r²≈ 0.604. This means that about 60% of the variation in the diamond prices is accounted for by the best-fit line relating weight and price and 40% of the variation in price must be due to other factors.
Sec. 7.3 Example 2 • You are the manager of a large department store. Over the years, you’ve found a reasonably strong positive correlation between your September sales and the number of employees you’ll need to hire for peak efficiency during the holiday season. The correlation coefficient is 0.950. This year your September sales are fairly strong. Should you start advertising for help based on the best-fit line? • r²= 0.903, which means that 90% of the variation in the number of peak employees can be accounted for by a linear relationship with September sales, leaving only 10% unaccounted for • Because 90% is so high, it is a good idea to predict the number of employees you’ll need using the best-fit line
Sec. 7.3 Multiple Regression • Multiple regression is a technique that allows us to find a best-fit equation relating one variable to more than one other variable • Ex. ~ Price of diamonds in comparison to carat, cut, clarity, and color • The coefficient of determination (R²) is the most common measure in a multiple regression • This tells us how much of the scatter in the data is accounted for by the best-fit equation • If R²is close to 1, the best-fit equation should be very useful for making predictions within the range of the data • If R²is close to 0, the predictions are essentially useless