600 likes | 782 Views
2012. Introduction to Predictive Modeling with Examples David A. Dickey North Carolina State University. Cool < ------------------ > Nerdy “Analytics” “Statistics” “Predictive Modeling” “Regression”. Part 1: Simple Linear Regression.
E N D
2012 Introduction to Predictive Modeling with Examples David A. Dickey North Carolina State University
Cool < ------------------ > Nerdy “Analytics” “Statistics” “Predictive Modeling” “Regression” Part 1: Simple Linear Regression
If the Life Line is long and deep, then this represents a long life full of vitality and health. A short line, if strong and deep, also shows great vitality in your life and the ability to overcome health problems. However, if the line is short and shallow, then your life may have the tendency to be controlled by others http://www.ofesite.com/spirit/palm/lines/linelife.htm
Wilson & Mather JAMA 229 (1974) X=life line length Y=age at death proc sgplot; scatter Y=age X=line; reg Y=age X=line; run ; Result: Predicted Age at Death = 79.24 – 1.367(lifeline) (Is this “real”??? Is this repeatable???)
We Use LEAST SQUARES Squared residuals sum to 9609
“Best” line is the one that minimizes sum of squared residuals. Best for this sample – is it the true relationship for everyone? SAS PROC REG will compute it. What other lines might be the true line for everyone?? Probably not the purple one. Red one has slope 0 (no effect). Is red line unreasonable? Can we reject H0:slope is 0?
Simulation: Age at Death = 67 + 0(life line) + e Error e has normal distribution mean 0 variance 200. Simulate 20 cases with n= 50 bodies each. NOTE: Regression equations : Age(rep:1) = 80.56253 - 1.345896*line. Age(rep:2) = 61.76292 + 0.745289*line. Age(rep:3) = 72.14366 - 0.546996*line. Age(rep:4) = 95.85143 - 3.087247*line. Age(rep:5) = 67.21784 - 0.144763*line. Age(rep:6) = 71.0178 - 0.332015*line. Age(rep:7) = 54.9211 + 1.541255*line. Age(rep:8) = 69.98573 - 0.472335*line. Age(rep:9) = 85.73131 - 1.240894*line. Age(rep:10) = 59.65101 + 0.548992*line. Age(rep:11) = 59.38712 + 0.995162*line. Age(rep:12) = 72.45697 - 0.649575*line. Age(rep:13) = 78.99126 - 0.866334*line. Age(rep:14) = 45.88373 + 2.283475*line. Age(rep:15) = 59.28049 + 0.790884*line. Age(rep:16) = 73.6395 - 0.814287*line. Age(rep:17) = 70.57868 - 0.799404*line. Age(rep:18) = 72.91134 - 0.821219*line. Age(rep:19) = 55.46755 + 1.238873*line. Age(rep:20) = 63.82712 + 0.776548*line. Predicted Age at Death = 79.24 – 1.367(lifeline) Would NOT be unusual if there is no true relationship .
Distribution of t Under H0 Conclusion: Estimated slopes vary Standard deviation of estimated slopes = “Standard error” (estimated) Compute t = (estimate – hypothesized)/standard error p-value is probability of larger |t| when hypothesis is correct (e.g. 0 slope) p-value is sum of two tail areas. Traditionally p<0.05 hypothesized value is wrong. p>0.05 is inconclusive.
proc reg data=life; model age=line; run; Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 79.23341 14.83229 5.34 <.0001 Line 1 -1.36697 1.59782 0.86 0.3965 Area 0.19825 Area 0.19825 0.39650 -0.86 0.86
SSE 95% Confidence ellipsoid a SSE plotted over an (a,b) grid. Truncated at SSE=F*m where m= minimum SSE b a b
Conclusion: insufficient evidence against the hypothesis of no linear relationship. H0: H1: H0: Innocence H1: Guilt Beyond reasonable doubt P<0.05 H0: True slope is 0 (no association) H1: True slope is not 0 P=0.3965
Want estimate of variability around the true line ( ). Use sums of squared residuals (SS). Sum of squared residuals from the mean is “SS(total)” 9755 Sum of squared residuals around the line is “SS(error)” 9609 (1) SS(total)-SS(error) is SS(model) = 146 (2) Variance estimate is SS(error)/(degrees of freedom) = 200 (3) SS(model)/SS(total) is R2, i.e. proportion of variablity “explained” by the model. Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 146.51753 146.51753 0.73 0.3965 Error 48 9608.70247200.18130 Corrected Total 49 9755.22000 Root MSE 14.14854 R-Square 0.0150
Part 2: Multiple Regression Issues: (1) Testing joint importance versus individual significance (2) Prediction versus modeling individual effects (3) Collinearity (correlation among inputs) Example: Hypothetical company’s sales Y depend on TV advertising X1 and Radio Advertising X2. Y = b0 + b1X1 + b2X2 +e Two engine plane can still fly if engine #1 fails Two engine plane can still fly if engine #2 fails Neither is critical individually Jointly critical (can’t omit both!!)
Data Sales; length sval $8; length cval $8; input store TV radio sales; (more code) cards; 1 869868 9089 2 836820 8290 (more data) 40 969961 10130 Sales Radio TV proc g3d data=sales; scatter radio*TV=sales/shape=sval color=cval zmin=8000; run;
Conclusion: Can predict well with just TV, just radio, or both! SAS code: proc reg data=next; model sales = TV radio; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 32660996 16330498 358.84<.0001 (Can’t omit both) Error 37 1683844 45509 Corrected Total 39 34344840 Root MSE 213.32908 R-Square 0.9510 Explaining 95% of variation in sales Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 531.11390 359.90429 1.48 0.1485 TV 1 5.00435 5.01845 1.00 0.3251 (can omit TV) radio 1 4.66752 4.94312 0.94 0.3512 (can omit radio) Estimated Sales = 531 + 5.0 TV + 4.7 radio with error variance 45509 (standard deviation 213).
Estimated Sales = 531 + 5.0 TV + 4.7 radio Setting TV = radio (approximaterelationship) Estimated Sales = 531 + 9.7 TV isthis the BEST TV line? Estimated Sales = 531 + 9.7 radio isthis the BEST radio line? Proc Reg Data=Stores; Model Sales = TV; Model Sales = radio; run;
Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 32620420 32620420 718.84 <.0001 Error 38 1724420 45379 Corrected Total 39 34344840 Root MSE 213.02459 R-Square 0.9498 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 478.50829 355.05866 1.35 0.1857 TV 1 9.73056 0.36293 26.81 <.0001 ********************************************************************************************* Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 32615742 32615742 716.79 <.0001 Error 38 1729098 45503 Corrected Total 39 34344840 Root MSE 213.31333 R-Square 0.9497 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 612.08604 350.59871 1.75 0.0889 radio 1 9.58381 0.35797 26.77 <.0001
Sums of squares capture variation explained by each variable Type I: How much when it is added to the model? Type II: How much when all other variables are present (as if it had been added last) Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS Intercept 1 531.11390 359.90429 1.48 0.1485 3964160640 99106 TV 1 5.00435 5.01845 1.00 0.3251 32620420 45254 radio 1 4.66752 4.94312 0.94 0.3512 4057640576 *********************************************************************************** Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS Intercept 1 531.11390 359.90429 1.48 0.1485 3964160640 99106 radio 1 4.66752 4.94312 0.94 0.3512 32615742 40576 TV 1 5.00435 5.01845 1.00 0.3251 4525445254
Summary: Good predictions given by Sales = 531 + 5.0 x TV + 4.7 x Radio or Sales = 479 + 9.7 x TV or Sales = 612 + 9.6 x Radio or (lots of others) Why the confusion? The evil Multicollinearity!! (correlated X’s)
Those Mysterious “Degrees of Freedom” (DF) First Martian information about average height 0 information about variation. 2nd Martian gives first piece of information (DF) about error variance around mean. n Martians n-1 DF for error (variation)
Martian Height 2 points no information on variation of errors n points n-2 error DF Martian Weight
Sum of Mean Source DF Squares Square Model 2 32660996 16330498 Error 37 1683844 45509 Corrected Total 39 34344840 How Many Table Legs? (regress Y on X1, X2) error X2 X1 Three legs will all touch the floor. Fourth leg gives first chance to measure error (first error DF). Fit a plane n-3 (37) error DF (2 “model” DF, n-1=39 “total” DF) General idea Regress Y on X1 X2 … Xk n-1-k error DF (k “model” DF, n-1 “total” DF)
Grades vs. IQ and Study Time Data tests; input IQ Study_Time Grade; IQ_S = IQ*Study_Time; cards; 105 10 75 110 12 79 120 6 68 116 13 85 122 16 91 130 8 79 114 20 98 102 15 76 ; Proc reg data=tests; model Grade = IQ; Proc reg data=tests; model Grade = IQ Study_Time; Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 62.57113 48.24164 1.30 0.2423 IQ 1 0.16369 0.41877 0.39 0.7094 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 0.73655 16.26280 0.05 0.9656 IQ 1 0.47308 0.12998 3.64 0.0149 Study_Time 1 2.10344 0.26418 7.96 0.0005
Contrast: TV advertising looses significance when radio is added. IQ gains significance when study time is added. Model for Grades: Predicted Grade = 0.74 + 0.47 x IQ + 2.10 x Study Time Question: Does an extra hour of study really deliver 2.10 points for everyone regardless of IQ? Current model only allows this.
proc reg; model Grade = IQ Study_Time IQ_S; Sum of Mean Source DF Squares Square F Value Pr > F Model 3 610.81033 203.60344 26.22 0.0043 Error 4 31.06467 7.76617 Corrected Total 7 641.87500 Root MSE 2.78678 R-Square 0.9516 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 72.20608 54.07278 1.34 0.2527 IQ 1 -0.13117 0.45530 -0.29 0.7876 Study_Time 1 -4.11107 4.52430 -0.91 0.4149 IQ_S 1 0.05307 0.03858 1.38 0.2410 “Interaction” model: Predicted Grade = 72.21 - 0.13 x IQ - 4.11 x Study Time + 0.053 x IQ x Study Time = (72.21 - 0.13 x IQ )+( - 4.11 + 0.053 x IQ )x Study Time IQ = 102 predicts Grade = (72.21-13.26)+(5.41-4.11) x Study Time = 58.95+ 1.30 x Study Time IQ = 122 predicts Grade = (72.21-15.86)+(6.47-4.11) x Study Time = 56.35 + 2.36 x Study Time
Slope = 2.36 Slope = 1.30 • Adding interaction makes everything insignificant (individually) ! • Do we need to omit insignificant terms until only significant ones remain? • Has an acquitted defendant proved his innocence? • Common sense trumps statistics!
Part 3: Diagnosing Problems in Regression Main problems are Multicollinearity (correlation among inputs) Outliers Principal Component Axis 1 Proc Corr; Var TV radio sales; Pearson Correlation Coefficients, N = 40 Prob > |r| under H0: Rho=0 TV radio sales TV 1.00000 0.99737 0.97457 <.0001 <.0001 radio 0.99737 1.00000 0.97450 <.0001 <.0001 sales 0.97457 0.97450 1.00000 <.0001 <.0001 TV $ Principal Component Axis 2 Radio $
Principal Components • Center and scale variables to mean 0 variance 1. • Call these X1 (TV) and X2 (radio) • n variables total variation is n (n=2 here) • Find most variable linear combination P1=__X1+__X2 TV 1.00000 0.99737 <.0001 radio 0.99737 1.00000 <.0001 Variances are 1.9973 out of 2 (along P1 axis) standard deviation and 0.0027 out of 2 (along P2 axis) standard deviation Ratio of standard deviations (27.6) is “condition number” large unstable regression. Rule of thumb: Ratio 1 is perfect, >30 problematic. Spread on long axis is 27.6 times that on short axis. Variance Inflation Factor (1) Regress predictor i on all the others getting r-square: Ri2 (2) VIF is 1/(1- Ri2 ) for variable i (measures collinearity). (3) VIF > 10 is a problem.
Variance Inflation Factor (1) Regress predictor i on all the others getting r-square: Ri2 (2) VIF is 1/(1- Ri2 ) for variable i (measures collinearity). (3) VIF > 10 is a problem. Example: Proc Reg Data=Sales; Model Sales = TV Radio/VIF collinoint; Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > |t| Inflation Intercept 1 531.11390 359.90429 1.48 0.1485 0 TV 1 5.00435 5.01845 1.00 0.3251 190.65722 radio 1 4.66752 4.94312 0.94 0.3512 190.65722 Collinearity Diagnostics (intercept adjusted) Condition --Proportion of Variation- Number EigenvalueIndex TV radio 1 1.99737 1.00000 0.00131 0.00131 2 0.00263 27.579480.998690.99869 We have a MAJOR problem! (note: other diagnostics besides VIF and condition number are available)
TV ‚ 1200 ˆ ‚ + ‚ ‚ + ‚ + ‚ ++ ‚ +++ ‚ + ‚ + + + 1000 ˆ ++ ‚ ++++ ‚ ++++ ‚ + + ‚ ++ ‚ ++ ‚ + ‚ + ‚ ++ 800 ˆ+ Šˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒ 800 1000 1200 radio Another problem: Outliers Example: Add one point to TV-Radio data TV 1021, radio 954, Sales 9020 Proc Reg: Model Sales = TV radio/ p r; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 33190059 16595030 314.07 <.0001 Error 38 2007865 52839 Corrected Total 40 35197924 Root MSE 229.86639 R-Square 0.9430 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 689.01260 382.52628 1.80 0.0796 TV 1 -6.28994??? 2.90505 -2.17 0.0367 radio 1 15.78081 2.86870 5.50 <.0001 Dependent Predicted Std Error Student Cook's Obs Variable Value Residual Residual Residual -2-1 0 1 2 D 39 9277 9430 -153.4358 225.3 -0.681 | *| | 0.006 40 10130 9759 370.5848 226.1 1.639 | |*** | 0.030 41 9020 9322 -301.8727 121.9 -2.476 | ****| | 5.224
Ordinary residual for store 41 not too bad (-300.87) • PRESS residuals • Remove store i , Sales Y(i) • Fit model to other 40 stores • Get model prediction P(i) for store I • PRESS residual is Y(i)-P(i) proc reg data=raw; model sales = TV radio; output out=out1 r=r press= press; run; Regular O and PRESS (dot) residuals View Along the P2 Axis Store number 41 P2 (2nd Principal Component)
Part 4: Classification Variables (dummy variables, indicator variables) Predicted Accidents = 1181 + 2579 X11 X11 is 1 in November, 0 elsewhere. Interpretation: In November, predict 1181+2579(1) = 3660. In any other month predict 1181 + 2579(0) = 1181. 1181 is average of other months. 2579 is added November effect (vs. average of others) Model for NC Crashes involving Deer: Proc reg data=deer; model deer = X11; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 30473250 30473250 90.45 <.0001 Error 58 19539666 336891 Corrected Total 59 50012916 Root MSE 580.42294 R-Square 0.6093 Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 1181.09091 78.26421 15.09 <.0001 X11 1 2578.50909 271.11519 9.51 <.0001
Looks like December and October need dummies too! Proc reg data=deer; model deer = X10 X11 X12; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 46152434 15384145 223.16 <.0001 Error 56 3860482 68937 Corrected Total 59 50012916 Root MSE 262.55890 R-Square 0.9228 Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 929.40000 39.13997 23.75 <.0001 X10 1 1391.20000 123.77145 11.24 <.0001 X11 1 2830.20000 123.77145 22.87 <.0001 X12 1 1377.40000 123.77145 11.13 <.0001 Average of Jan through Sept. is 929 crashes per month. Add 1391 in October, 2830 in November, 1377 in December.
What the heck – let’s do all but one (need “average of rest” so must leave out at least one) Proc reg data=deer; model deer = X1 X2 … X10 X11; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 11 48421690 4401972 132.79 <.0001 Error 48 1591226 33151 Corrected Total 59 50012916 Root MSE 182.07290 R-Square 0.9682 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 2306.80000 81.42548 28.33 <.0001 X1 1 -885.80000 115.15301 -7.69 <.0001 X2 1 -1181.40000 115.15301 -10.26 <.0001 X3 1 -1220.20000 115.15301 -10.60 <.0001 X4 1 -1486.80000 115.15301 -12.91 <.0001 X5 1 -1526.80000 115.15301 -13.26 <.0001 X6 1 -1433.00000 115.15301 -12.44 <.0001 X7 1 -1559.20000 115.15301 -13.54 <.0001 X8 1 -1646.20000 115.15301 -14.30 <.0001 X9 1 -1457.20000 115.15301 -12.65 <.0001 X10 1 13.80000 115.15301 0.12 0.9051 X11 1 1452.80000 115.15301 12.62 <.0001 Average of rest is just December mean 2307. Subtract 886 in January, add 1452 in November. October (X10) is not significantly different than December.
positive negative