200 likes | 469 Views
Week 5 Lecture 2 Chapter 8. Regression Wisdom. Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009.
E N D
Week 5 Lecture 2Chapter 8. Regression Wisdom
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009 The centre for Disease Control and Prevention track cigarette smoking in the US. How has the percentage of people who smoke changed since the danger became clear during the last half of the 20th century?
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009 The scatterplot shows percentage of smokers among men 18-24 years of age, as estimated by surveys, from 1965 through 2009. • The percent of men age 18–24 who are smokers decreased dramatically between 1965 and 1990, but the trend has not been consistent since then. • The association between percent of men age 18–24 who smoke and year is very strong from 1965 to 1990, but is erratic after 1990. • A linear model is not an appropriate model for the trend in the percent of males age 18–24 who are smokers. The relationship is not straight. • The regression equation is: male smoking % = 986.99552 - 0.47919438 Year • R-sq = 0.7047499 (70.47%)
Checking the Assumptions of Regression Model Residual points are normally distributed.
Checking the Assumptions of Regression Model • Plot: Residuals vs. Predictor Variable (Year) • Nonlinearity is more prominent. • Residual points are not randomly plotted around the zero line; they are not evenly spread out. • Residual points form a curvature pattern. • Regression model is not correct.
Checking the Assumptions of Regression Model • No regression analysis is complete without a display of the residuals to check that the linear model is reasonable. • Residuals often reveal subtleties that were not clear from a plot of the original data (e.g. scatterplot of y vs. x) • Sometimes they reveal violations of the regression conditions that require our attention. • It is good to look at both a histogram of residual (or histogram of standardized residuals or the normal QQ plot of residuals) and a scatterplot of the residuals vs. predictor variable.
Percentage of Both Men and Women Smokers (18 – 24 years of age) from 1965 through 2009 The centre for Disease Control and Prevention track cigarette smoking in the US. How have the percentages of men and women who smoke changed since the danger became clear during the last half of the 20th century?
Scatterplot for Men and Women Smokers (18 – 24 years of age) from 1965 through 2009 • Smoking rates for both men and women in the US have decreased significantly over the time period from 1965 to 2009. • Smoking rates are generally lower for women than for men. • The trend in the smoking rates for women seems a bit straighter than the trend for men. • The apparent curvature in the scatterplot for the men could possibly be due to just a few points, and not an indication of a serious violation of the linearity condition.
Scatterplot for Men and Women Smokers (18 – 24 years of age) from 1965 through 2009 StatCrunch Command: Graph > Scatter Plot X-variable: Year Y-Variable: Smoking % Group by: Sex Grouping Options: Color points by group Overlay polynomial order: 1 Group properties: Color scheme: Alternate – 7 colors Click Compute
Men and Women Smokers (18 – 24 years of age) from 1965 through 2009Graph on the left: Not taking group into accountGraph on the right: Identify by group (male orfemale)
Men and Women Smokers (18 – 24 years of age) from 1965 through 2009Not taking group into account Smoking % = 953.31052 - 0.46382114 YearSample size: 34R (correlation coefficient) = -0.80476796R-sq = 0.64765148
Analysis of Residual Points Looks like we have two groups.
Analysis of Residual Points • An examination of residuals often leads us to discover groups of observations that are different from the rest. • Histogram might show multiple modes. • When we discover there is more than one group in a regression, we may decide to analyze the groups separately using a different model for each group.
Outliers • Any point that stands away from the others can be called an outlier and deserves your special attention. • Outlying points can strongly influence a regression. Even a single point far from the body of the data can dominate the analysis.
High Leverage Points A data point that has an x-value far from the mean of the x-values is called a high leverage point. Examples:
Influential Observations A data point is influential if omitting from the analysis gives a very different model. Examples: Relationship between Murder rate and poverty level for 51 state (including the state: DC) Note: DC is far from the rest of the data (overall pattern) and is observed in a different direction than the rest. Dependent Variable: Murder RateIndependent Variable: Poverty Rate Murder Rate = -3.6792483 + 0.68731484 Poverty RateSample size: 51R (correlation coefficient) = 0.4735608R-sq = 0.22425983Estimate of error standard deviation: 3.9143851
Omitting the Observation for DC Examples: Relationship between Murder rate and poverty level for 50 state (excluding DC) Dependent Variable: Murder RateIndependent Variable: Poverty Rate Murder Rate = -0.65671571 + 0.41331907 Poverty RateSample size: 50R (correlation coefficient) = 0.53936435R-sq = 0.29091391
Restricted-range Problem When one of the variables is restricted (you only look at some of the values), the correlation can be surprisingly low. We will visit an example from the web, from David Lane: http://davidmlane.com/hyperstat/A68809.html The demo video is found here: http://onlinestatbook.com/2/describing_bivariate_data/restriction_demo.html
Working with Summary Statistics Graph below shows that there appears to be a strong, positive, linear association between weight (in pounds) and height (in inches) for men. Graph below shows that if instead of data on individuals we only had the mean weight for each height value, we would see an even stronger association. • We see less scattered points. • It can give a false impression of how well a line summarizes the data. • We have a problem of overestimating or underestimating.