130 likes | 274 Views
Lecture 21: Review. Review a few points about regression that I went over quickly concerning coefficient of determination, regression diagnostics and transformation. Review ANOVA problem. Review regression problem. Administrative Info for Midterm II.
E N D
Lecture 21: Review • Review a few points about regression that I went over quickly concerning coefficient of determination, regression diagnostics and transformation. • Review ANOVA problem. • Review regression problem.
Administrative Info for Midterm II • Time and Location: Wednesday, April 2, 6-8 p.m. Steinberg Hall-Dietrich Hall 351. • Closed book, allowed one 8.5 x 11 double sided note sheet. • Bring calculator • All necessary tables will be provided but nothing additional (e.g., Tukey’s bulging rule will not be provided). • Office hours: Today after class (12:10-2:30), Wednesday 9-11:30
Material Covered • Focus is on Chapter 15 and Chapter 18 (we covered everything except 15.6 and 18.8) • Chapters 13.5-13.6 are not covered. • Be prepared that questions could draw on your knowledge of material from first midterm in context of Chapter 15 and Chapter 18.
Coefficient of Determination (R2) • R2 measures the strength of the linear relationship between Y and X • Formulas for R2: • Square of correlation between X and Y (thus if Cor(X,Y)=-0.5, then R2=0.25) • R2=1-(SSE/SSTOT)=SSR/SSTOT. SSR is called sums of square due to model in JMP output. Information about SSE, SSR, SSTOT can be obtained from Analysis of Variance section of output for regression in JMP.
Impact of Large Sample Sizes • R2 will on average be the same, no matter what the sample size. • However, if there is a linear relationship between X and Y, the p-value for the test for whether the slope is zero will tend to become smaller as the sample size increases. Even if the linear relationship between Y and X is weak (but the slope is not zero), the test will have a small p-value for a large sample size.
Prediction Intervals vs. Confidence Intervals • Prediction Interval: Used when we want to predict one particular value of y given a specific value of x, e.g., a used car dealer wants to predict price of a particular Ford Taurus given that it has 40,000 miles. • Confidence Interval for estimator of expected value of y: Used when we want to estimate the mean of y given x, e.g., a used car dealer wants to bid on a lot of 200 Ford Tauruses with 40,000 miles and wants to know the mean price of a Ford Taurus given that has 40,000 miles.
The prediction interval • The confidence interval Prediction Intervals vs. Confidence Intervals Cont. As the sample size becomes large, the width of the confidence interval tends to zero but the width of the prediction interval tends to
Influential Points and Outliers • In addition to doing the previous diagnostics, you should check residual plots for influential points and outliers (in y, x and direction of scatterplot). • Influential point: Outlier in direction of x (has high leverage) and does not fall into exactly the same pattern of relationship between y and x as the other points. • Investigate whether outliers and influential points are properly recorded and are representative of the population we are interested in.
Diagnosing Nonlinearity • Check residual plot vs. x to see if there is a pattern.
Transformations • If there is nonlinearity, one possible way to correct for it is to apply a transformation to y or x. • Tukey’s bulging rule (see handout) Match curvature in data to shape of one of the curves drawn in the four quadrants. Apply one of the transformations listed.
Tukey’s Bulging Rule • Curvature appears to match top left quadrant. Try transformation to log X.