6.1.4 AIC, Model Selection, and the Correct Model

6.1.4 AIC, Model Selection, and the Correct Model • Any model is a simplification of reality • If a model has relatively little bias, it tends to provide accurate estimates of the quantities of interest • Best model is often the simplest (less parameters)- model parsimony Akaike Information Criterion (AIC)- alternative to significance tests to estimate quantities of interest • Criterion for choosing between competing statistical models • AIC judges a model by how close its fitted values tend to be to the true values • The AIC selects the model that minimizes: AIC = -2(maximized log likelihood – # parameters in the model) • This penalizes a model for having too many parameters • Serves the purpose of model comparison only; does not provide diagnostic about the fit of the model to the data

AIC = -2(maximized log likelihood – # parameters in the model) In SAS: AIC = -2LogL + 2p Crab Example : Table 6.2 (p. 215): The best models have smallest AIC’s • Best models have main effects, COLOR and WIDTH (AIC = 197.5) PROC LOGISTIC (Backward Elimination) : proclogisticdescending ; class color spine / param = ref ; model y = width weight color spine / selection = backward lackfit ; Backward Elimination Procedure Step 0. The following effects were entered: Intercept width weight color spine Step 1. Effect spine is removed Step 2. Effect weight is removed In our case, AIC is equal in all steps: 227.759 = -2LogL + 2p = 225.759 + 2(1), where p = 1

6.1.5 Using Causal Hypotheses to Guide Model Building • Rather than using selection techniques, such as stepwise, which look at significance levels of each parameter, use theory and common sense to build a model (Add and remove parameters that make sense) • A time ordering among variables may suggest causal relationships Example : (table 6.3, p. 217) In a British study, 1036 men and women (married and divorced) were asked whether they’ve had premarital and/or extramarital sex. We want to determine whether G = gender, P = premarital sex, and E = extramarital sex are factors in whether a person is M= married or divorced. Simple Model : G → P → E → M Any of these is an explanatory variable when a variable listed to its right is the response Complex Model (Triangular) : (Fig. 3.1, p. 218) 1st stage : predicts G has a direct effect on P 2nd stage : predicts P and G have direct effects on E 3rd stage : predicts E has direct effect on M ; P has direct and indirect effects on M; G has indirect effects through P and E

Table 6.4 : Goodness of Fit Tests for Model Selection 1st Stage : predicts Gender has a direct effect on Premarital Sex The estimated odds of premarital sex for females is .27 times that for males. data causal2 ; input gender $ PMS TOTALPMS ; datalines ; F 100 676 M 141 360 ; Model (Response P, no Actual Explanatory) PROCGENMODDATA = CAUSAL2 DESCENDING ; CLASS GENDER ; MODEL PMS/TOTALPMS = / DIST = BIN LINK = LOGIT; Model (Response P, Actual Explanatory G) PROCGENMODDATA = CAUSAL2 DESCENDING ; CLASS GENDER ; MODEL PMS/TOTALPMS = GENDER / DIST = BIN LINK = LOGIT TYPE3RESIDUALSOBSTATS ;

Goodness of Fit as a Likelihood-Ratio Test The L-R statistic -2(L0 – L1) test whether certain model parameters are zero by comparing the log likelihood L1 for the fitted model M1 with L0 for the simpler model M0 (formula p. 187) For the example, we will use the fact -2(L0 – L1) = G2(M0) - G2(M1) using SAS output. 1st Stage : G2 = G2(M0) - G2(M1) = 75.2594 – 0.0000 = 75.2594 -2(L0 – L1) = -2(-561.9568 – (-524.3271) = 75.2594 Df = 1 – 0 = 1, so χ2 p-value < .001 and there is evidence of a gender effect on pre marital sex suggesting having G as an explanatory variable is a better model. 2nd Stage : predicts Gender and Premarital Sex have direct effects on Extramarital Sex data causal3 ; input gender $ PMS $ EMS TOTALEMS ; datalines ; F Y 21 100 F N 40 576 M Y 39 141 M N 21 219 ;

Model (Response E, no Actual Explanatory) PROCGENMODDATA = CAUSAL3 DESCENDING ; CLASS GENDER PMS ; MODEL EMS/TOTALEMS = / DIST = BIN LINK = LOGIT TYPE3RESIDUALSOBSTATS ; Model (Response E, P Actual Explanatory) PROCGENMODDATA = CAUSAL3 ; CLASS GENDER PMS ; MODEL EMS/TOTALEMS = PMS / DIST = BIN LINK = LOGIT TYPE3RESIDUALSOBSTATS ; Model (Response E, G+P Actual Explanatory) PROCGENMODDATA = CAUSAL3 DESCENDING ; CLASS GENDER PMS ; MODEL EMS/TOTALEMS = GENDER PMS / DIST = BIN LINK = LOGIT TYPE3RESIDUALSOBSTATS ; Model (Response E, no Actual Explanatory) Model (Response E, P Actual Explanatory) Model E = 1 vs. E = P G2(M0) - G2(M1) = 48.9244 – 2.9080 = 46.016 -2(L0 – L1) = -2(-373.4687–(-350.4605) = 46.016 df = 3-2= 1, so χ2 p-value < .001, so there is evidence of a P effect on E Model E = P vs. E = G+P G2 = G2(M0) - G2(M1) = 2.9080 - .0008 = 2.9 df = 2-1 = 1, so χ2 p-value > .10 so only weak evidence occurs that G had a direct effect as well as indirect effect on E. So E = P is a sufficient model. Model (Response E, G+P Actual Explanatory

3rd stage : predicts Extramarital Sex has direct effect on Marriage ; Premarital Sex has direct and indirect effects on Marriage; Gender has indirect effects through PMS and EMS data causal ; input gender $ PMS $ EMS $ DIVORCED TOTAL ; datalines ; F Y Y 17 21 F Y N 54 79 F N Y 36 40 F N N 214 536 M Y Y 28 39 M Y N 60 102 M N Y 17 21 M N N 68 198 ; Model M = E + P vs. M = E*P G2 = G2(M0) - G2(M1) = 18.1596 – 5.2455 = 12.91, with df = 5-4= 1 so χ2 p-value < .10 so the interaction EMS*PMS is a better model to predict Divorce Model M = E*P vs. M = E*P + G G2 = G2(M0) - G2(M1) = 5.2455 - .6978 = 4.5477, with df = 4-3= 1 so χ2 .025 < p-value < .05 so adding G to interaction EMS*PMS fits slightly better. Conclusion for Causal Relationships Good alternative for model building by using common sense to hypothesize relationships

6.1.6 New Model-Building Strategies for Data Mining • Data mining is the analysis of huge data sets, in order to find previously unsuspected relationships which are of interest or value • Model Building is challenging • There are alternatives to traditional statistical methods, such as automated algorithms that ignore concepts such as sampling error and modeling • Significance tests are usually irrelevant, since nearly any variable has significant effect if n is sufficiently large • For large n, inference is less relevant than summary measures of predictive power

6.1.4 AIC, Model Selection, and the Correct Model