140 likes | 256 Views
Regression Model Building. LPGA Golf Performance - 2008. Data Description. Response: log(Prize Winnings/Round) – Skewed data Potential Predictors: Average Drive Distance Percentage of Drives Reaching Fairway Percentage of Green s Reached in Regulation Average Putts per Hole
E N D
Regression Model Building LPGA Golf Performance - 2008
Data Description • Response: log(Prize Winnings/Round) – Skewed data • Potential Predictors: • Average Drive Distance • Percentage of Drives Reaching Fairway • Percentage of Greens Reached in Regulation • Average Putts per Hole • Average Number of Sand Traps Hit per Round (Sandshot) • Percentage of Sand Saves • Samples: • Training Sample – 100 Randomly Sampled Golfers • Validation Sample – 57 Remaining Golfers used to assess fit
Modeling Strategies • Select Training Sample • Select “best” subset of predictors based on Backward Elimination, Forward Selection, Stepwise Regression and/or All Possible Regressions based on Minimizing: • Identify any Influential Observations (based on Outliers, Leverage Values, DFFITS, DFBETAS, Cook’s D) • Test Model Assumptions: Normality (Shapiro-Wilk), Constant Variance (Brown-Forsyth and Breusch-Pagan) • Determine Validity of model by obtaining prediction errors for validation sample
Backward Elimination (RSS = SSE) Step 1: Start: AIC=-200.22 logprz ~ drive + fairway + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC - fairway 1 0.010 11.750 -202.132 <none> 11.740 -200.216 - drive 1 0.397 12.138 -198.887 - sandsave 1 0.405 12.145 -198.827 - sandshot 1 1.030 12.770 -193.806 - green 1 24.960 36.700 -88.238 - putts 1 35.360 47.100 -63.289 Step 2: AIC=-202.13 logprz ~ drive + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC <none> 11.750 -202.132 - sandsave 1 0.400 12.150 -200.784 - drive 1 0.537 12.287 -199.665 - sandshot 1 1.034 12.784 -195.698 - green 1 32.091 43.841 -72.461 - putts 1 35.688 47.438 -64.575 • At Step 1, Fairway is eliminated, AIC Is minimized (-202.132 < -200.216) • At Step 2, no other variables are removed (no AIC < -202.132)
Forward Selection (RSS = SSE) Step 1: Start: AIC=-6.61 logprz ~ 1 Df Sum of Sq RSS AIC + green 1 38.599 53.150 -59.206 + putts 1 33.043 58.706 -49.263 + drive 1 11.622 80.126 -18.156 + sandshot 1 8.951 82.798 -14.876 + sandsave 1 3.118 88.631 -8.069 <none> 91.749 -6.611 + fairway 1 0.409 91.340 -5.058 Step 2: AIC=-59.21 logprz ~ green Df Sum of Sq RSS AIC + putts 1 39.514 13.636 -193.246 + sandsave 1 4.859 48.291 -66.793 <none> 53.150 -59.206 + fairway 1 0.635 52.514 -58.408 + drive 1 0.361 52.788 -57.888 + sandshot 1 0.004 53.146 -57.214 Step 3: AIC=-193.25 logprz ~ green + putts Df Sum of Sq RSS AIC + sandshot 1 0.73688 12.899 -196.80 + sandsave 1 0.66486 12.971 -196.25 + drive 1 0.31495 13.321 -193.58 <none> 13.636 -193.25 + fairway 1 0.09401 13.542 -191.94 Step 4: AIC=-196.8 logprz ~ green + putts + sandshot Df Sum of Sq RSS AIC + drive 1 0.74905 12.150 -200.78 + sandsave 1 0.61234 12.287 -199.66 <none> 12.899 -196.80 + fairway 1 0.25056 12.649 -196.76 Step 5: AIC=-200.78 logprz ~ green + putts + sandshot + drive Df Sum of Sq RSS AIC + sandsave 1 0.40005 11.750 -202.13 <none> 12.150 -200.78 + fairway 1 0.00524 12.145 -198.83 Step 6: AIC=-202.13 logprz ~ green + putts + sandshot + drive + sandsave Df Sum of Sq RSS AIC <none> 11.75 -202.13 + fairway 1 0.0099086 11.74 -200.22
Model – green, putts, sandshot, sandsave, drive Call: lm(formula = logprz ~ green + putts + sandshot + sandsave + drive, data = lpga.cv.in) Residuals: Min 1Q Median 3Q Max -0.72852 -0.20634 0.01067 0.22439 0.72316 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.272879 1.580975 9.028 2.14e-14 *** green 0.210379 0.013130 16.023 < 2e-16 *** putts -0.625367 0.037011 -16.897 < 2e-16 *** sandshot 0.790771 0.274937 2.876 0.00498 ** sandsave 0.008334 0.004658 1.789 0.07684 . drive -0.009563 0.004615 -2.072 0.04098 * --- Residual standard error: 0.3536 on 94 degrees of freedom Multiple R-squared: 0.8719, Adjusted R-squared: 0.8651 F-statistic: 128 on 5 and 94 DF, p-value: < 2.2e-16
Summary of Influence Measures - I • Studentized Residuals (Exceed 3.607 in absolute value) • Extreme values (in absolute value): -2.172 and +2.112 • Leverage Values (Exceed 0.12) • Golfers 111 (h=0.1543), 127 (0.1263), 113 (0.1213) (No big problem) • DFFITS (Exceed 0.49 in absolute value) • Three Golfers between -0.61 and -0.49 (Golfers 142, 91, and 117) • One Golfer between 0.49 and 0.59 (Golfer 59) • Cook’s D (Exceed 1, sometimes suggested to exceed 0.5) • Max value is .0626. None come close to 1 (or the sometimes suggested ½)
Summary of Influence Measures • DFBETAS (Exceed 0.20 in absolute value) • Intercept: Golfer 117 (-0.54), 28 (0.24), 45 (0.29), 59 (0.34), 142 (0.45) • Greens: Golfer 132 (-0.25), 91 (0.24), 110 (0.25), 142 (0.33) • Putts: Golfer 142 (-0.41), 25 (0.24), 117 (0.43) • Sandshots: Golfer 132 (-0.25), 111 (0.23), 39 (0.23), 110 (0.24) • Sandsaves: Golfers 59 (-0.43), 22 (-0.31), 91 (-0.30), 102 (-0.25), 115 (0.23), 47 (0.43) • Drive: Golfers 142 (-0.49), 59 (-0.24), 56 (0.28), 117 (0.29), 48 (0.30) • Note that while some of these exceed the “threshold” none seem to be way too excessive. However, golfers 142 and 117 appear regularly, they should be checked out
Residuals appear to be (reasonably) approximately normal. Shapiro-Wilk test does not reject the hypothesis of normal errors > shapiro.test(residuals(lpga.mod1)) Shapiro-Wilk normality test data: residuals(lpga.mod1) W = 0.9833, p-value = 0.2390
No Evidence of non-constant error variance (Data had been transformed prior to fitting model)
Equal (Homogeneous) Variance - I No evidence to reject the null hypothesis of equal variance among errors
Equal (Homogeneous) Variance There is no evidence of unequal variance, based on either Brown-Forsyth or Breusch-Pagan tests Breusch-Pagan test data: logprz ~ green + putts + sandshot + sandsave + drive BP = 1.9306, df = 5, p-value = 0.8587