180 likes | 194 Views
This lesson explores the dangers of extrapolating data using regression analysis. It examines examples such as predicting murder rates and marathon times far into the future and highlights the risks associated with making unreliable predictions.
E N D
Extrapolating Murder • The data and regression line for U.S. States relating x = percentage of single-parent families to y = annual murder rate (number of murders per 100,000 people in the population) is given below. Using the equation for the regression line stated in the figure, find the predicted murder rate at x = 0. Questions. What is the problem with this prediction? b) What is the general lesson to be learned from this example?
Extrapolating Murder: Conclusions • Questions. a) What is the problem with this prediction? The prediction for the murder rate is -8.25 murders per 100,000 people, this is nonsense. b) What is the general lesson to be learned from this example? For the data, single-parent families were between 14% and 30%, following the trend down to 0% single-parent families resulted in making a prediction that is too far away from the data to be reliable.
Winning Marathon Times • Winning times in the Boston marathon have followed a straight line decreasing trend from 160 minutes in 1927 to 130 minutes in 2004. After fitting a regression line to the winning times, it can be predicted that the winning time in the year 2300 will be about 13 minutes. • Questions. a) What is the problem with this prediction? b) What is the general lesson to be learned from this example?
Winning Marathon Times: Conclusions • Questions. a) What is the problem with this prediction? It is not feasible for a person to run a marathon in 13 minutes. Other factors, such as biological limitations, will become more dominant. b) What is the general lesson to be learned from this example? We cannot make reliable predictions too far from the range of the original data.
Summary: Extrapolation is Dangerous • Extrapolation refers to using a regression line to predict y-values for x-values outside of the range of data. • This is riskier as we move farther from that range
TV Watching and the Birth Rate • The plot below shows recent data on x = the number of televisions per 100 people and y = the birth rate (number of births per 1000 people) for six African and Asian nations. The regression line y = 29.8 – 0.024x applies to the data for these six countries. If the data for the United States (81,15.2) is added, the regression line for all seven points is y = 31.2 – 0.195x. In addition, the correlation is r = -0.051without the United States and r = -0.935 with the United States. Questions. In what ways does the United States appear to be an outlier? b) What would you conclude about the strength of the linear association without the United States, and with the United States? c) What is the general lesson to be learned from this example?
TV Watching and the Birth Rate: Conclusions • Questions. a) In what ways does the United States appear to be an outlier? With respect to both the x variable and the y variable, as well as the trend. b) What would you conclude about the strength of the linear association without the United States, and with the United States? It is not a strong association without the United States, but is quite strong with the United States. c) What is the general lesson to be learned from this example? Outliers can have quite an effect on the regression line and the correlation.
Education and Expensive Homes • A regression line for the data x = number of years of education and y = annual income for 100 people shows a modest positive trend, except for one person who dropped out after the 10th grade but is now a multi-millionaire. The correlation for the data including all 100 people is r = -0.28. • Questions. a) What can we say about the association if r = -0.28? b) Is your answer to part a) what we would want to report in our findings? c) What is the general lesson to be learned from this example?
Education and Expensive Homes: Conclusions • Questions. a) What can we say about the association if r = -0.28? It has a negative direction. b) Is your answer to part a) what we would want to report in our findings? No, the trend for the data, with the exception of this one outlier, had a positive direction. This is a better reflection of the data. c) What is the general lesson to be learned from this example? Outliers can change the direction of the association.
Summary: Be Cautious of Influential Outliers • An observation can be an outlier in its x-value, or in its y-value, or with respect to the trend of the other observations • Regression outliers are outliers with respect to the trend of the other observations. • An observation has a large effect on the regression line and/or correlation when • x-value is relatively low or high compared to rest of data • The observation is a regression outlier
Eating Ice Cream and Drowning • The “Gold Coast” of Australia is famous for its magnificent beaches. Because of strong rip tides, however, each year many people drown. Data collected each month for x = number of gallons of ice cream sold in refreshment stands along the beach that month and y = number of people who drowned that month shows a positive correlation. • Questions. a) Do you think the following is a good conclusion: Eating ice cream at the beach is a contributing factor in deaths from drowning? b) Can you identify another variable that could be responsible for this association? c) What is the general lesson to be learned from this example?
Eating Ice Cream and Drowning: Conclusions • Questions. a) Do you think the following is a good conclusion: Eating ice cream at the beach is a contributing factor in deaths from drowning? No, there is an association between the variables, but eating ice does not necessarily cause someone to drown. b) Can you identify another variable that could be responsible for this association? Mean temperature for the month. c) What is the general lesson to be learned from this example? Association is not the same as causation, and other variables may be influencing the association.
Smoking May Be Beneficial To Your Health • A survey of 1,314 women in the United Kingdom during 1972-1974 asked each woman whether she was a smoker. Twenty years later, a follow-up survey observed whether each woman was dead or still alive. The explanatory variable was smoker/non-smoker and the response variable was survival status after 20 years. We find that 24% of smokers died and 31% of non-smokers died. There was a greater survival rate for the smokers. • Questions. a) Do you think the following is a good conclusion: People who smoke tend to have a better survival rate? b) Can you identify another variable that could be responsible for this association? c) What is the general lesson to be learned from this example?
Smoking May Be Beneficial To Your Health: Conclusions • Questions. a) Do you think the following is a good conclusion: People who smoke tend to have a better survival rate? No, in this study there was a positive association between smoking and survival rate, but smoking was not necessarily responsible for the better survival rate for smokers. b) Can you identify another variable that could be responsible for this association? Age c) What is the general lesson to be learned from this example? Association is not the same as causation, and other variables may be influencing the association.
Summary: Association Does Not Imply Causation • A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest. • A confounding variable is observed and possibly influences the variables of primary interest. • Eating ice cream and drowning • Lurking variable might be mean temperature for the month • Smoking may be beneficial to your health • Lurking variable was age
Smoking May Be Beneficial To Your Health • The study found • 24% of smokers died • 31% of non-smokers died
Simpson’s Paradox • The direction of an association between two variables can change after we include a third variable and analyze the data at separate levels of the third variable. • This is called Simpson’s paradox.