1 / 21

Regression II Model Selection

Regression II Model Selection. Model selection based on t-distribution Information criteria Cross-validation. Introduction.

redford
Download Presentation

Regression II Model Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression IIModel Selection • Model selection based on t-distribution • Information criteria • Cross-validation

  2. Introduction When modelling relationship between input and out variables we need to make decision about the form of the model. If there are several input parameters then modelling decision could be that which input variable is important in predicting output. Let us say that we have several input variables and the proposed model is linear: y=β0 + β1 x1+β2 x2 + ε If some of the coefficients are 0 then corresponding input variable has no effect on predicting response – y. Another situation may arise when we want to explain output using polynomials on input variables. y=β0 + β1 x1+β2 x12 + ε In these cases we may want to know if some of the coefficients are 0. E.g. if β2is 0 then relationship between input and output is linear. Otherwise relationship may be higher order polynomials. In general if we have functions of increasing complexity gkthen it is likely that functions of higher complexity will give us better agreement between input and output. One of the ways of calculating agreement is residual sum of square (RSS): RSS = Σ(yi-gk(xi))2 Where i is the observation number and k is the function (model) number.

  3. Introduction One very simple way of testing if a particular coefficient of the linear model is 0 is testing hypothesis: H0:βj=0 versus H0:βj≠0 As we already know to do this test we need to know the distribution of the estimated parameters. We know that for large samples the distribution of the parameters asymptotically normal. We also know that covariance matrix of this distribution is: Where all xi0 are 1. The number of observations is n and the number of parameters is p. Under H0 the distribution of βjis normal distribution with mean 0 and the variance equal to the diagonal term of the covariance matrix. Therefore the distribution of is t-distribution with n-p degrees of freedom.

  4. Example Let us take an example of weight vs height and look at the results of linear model fitting using polynomials on input variable (height) up to fifth order. Plots show that second order polynomial fits sufficiently well. After the second order improvements are marginal. Natural question would be which model is better, can we justify more complex model based on the observations we have.

  5. Model selection using t-distribution: summary of lm method The distribution of estimated parameters are used in the summary of lm command. Let us say we want to see if some of the coefficients of the model: y=β0 + β1 x1+β2 x12 + β1 x13+β2 x14 + β1 x15 + ε After fitting the model using lm5 = lm(weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5)) summary(lm5) will produce a table that will have information about significance for each estimated parameter: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.144e+04 8.300e+04 1.102 0.299 height -6.947e+03 6.418e+03 -1.083 0.307 I(height^2) 2.107e+02 1.982e+02 1.063 0.315 I(height^3) -3.187e+00 3.058e+00 -1.042 0.325 I(height^4) 2.403e-02 2.355e-02 1.020 0.334 I(height^5) -7.224e-05 7.246e-05 -0.997 0.345 Residual standard error: 0.2245 on 9 degrees of freedom Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998 F-statistic: 1.335e+04 on 5 and 9 DF, p-value: < 2.2e-16

  6. Model selection using t-distribution More details of this table will be described in the tutorial. Here we are interested in not-rejecting null-hypothesis (that particular parameter is 0). If we are using t-distribution to to make such decision then we need to take the largest p-value and remove the corresponding coefficient from the model. In this case it is the coefficient corresponding to height5. After removing that we can do model fitting using the reduced model. We use: lm4 = lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4)) summary(lm4) It will produce significance levels for coefficients Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.824e+03 4.550e+03 1.939 0.0812 . height -5.552e+02 2.814e+02 -1.973 0.0768 . I(height^2) 1.319e+01 6.515e+00 2.024 0.0705 . I(height^3) -1.389e-01 6.693e-02 -2.076 0.0646 . I(height^4) 5.507e-04 2.574e-04 2.140 0.0581 . Now we cannot remove any of the coefficients, p-values are not large enough to say that particular coefficient is insignificant.

  7. Model selection using t-distribution summary command of the after lm command allows a simple way of model selection. It would work as follows: • Specify full model (it may be a problem. Model always can be made more complex) • Fit the model using lm • Use summary and remove the coefficient with the highest p-value (Say values that are more than 0.1 or 0.05. This value is arbitrary and depends how much would one want to reduce the model) • Repeat 2) and 3) until model can no longer be reduced Warning: Using multiple hypothesis testing may produce misleading results. For example if we are working on a significance level αand we want to test 10 hypothesis then the probability that at least one of the hypothesis will be significant when it is not is 1-(1-α)10. If α=0.05 then this probability will be 0.4 and if we are testing 100 hypothesis then this probability will be 0.99. There are corrections for multiple hypothesis testing. We will learn some of them in later lectures. When we used summary we deliberately did not use rejection of null-hypothesis. We used high p-values not to reject null-hypothesis.

  8. Akaike’s information criterion Another model selection tool is Akaike’s information criterion (AIC). This technique is based on the value of likelihood at the maximum and uses the number of parameters to penalise complex models. Let us remind us the maximum likelihood technique. Likelihood is proportional to the conditional distribution of observations (outputs, responses) given parameters of the model: In maximum likelihood estimation we maximise the likelihood function wrt the parameters of the model. By making model more and more complex we can get larger and larger likelihood values. But is more complex model justified? Once we have maximised the likelihood function we can calculate the value of the likelihood at the maximum. Akaike’s information criterion as implemented in R uses the following formula: Where c is a constant (may depend on observation number) that defines various forms of information criterion, p is the effective number of parameters in the model. In linear model p is equal to the number of parameters. When c=2 then we have the classic AIC and when c=log(n) then we have Bayesian information criterion (BIC).

  9. Akaike’s information criterion To apply AIC to the cases of linear model we need to specify likelihood function and find its value at the maximum. We can consider linear models with least square minimisation as a special case of maximum likelihood. Where σιis the standard deviation of the i-th observation. When all σιare equal to each other (denote them as σ) and they are unknown then we can write: Once we have found the maximum wrt β we can carry on and find the maximum wrt σ. Note that the maximum likelihood estimator for the variance is not unbiased. Unbiased estimator is

  10. Akaike’s information criterion Let us find the value of the likelihood at the maximum. We will have: If we put it in the expression of AIC we will have: In model selection the number of observations do not change so AIC can be written: If we use more complex model then RSS will be smaller. On the other hand model with smaller number of parameters may have better predictive power. AIC attempts to combine these two conflicting ideas together. When comparing two models, model with smaller AIC is considered better.

  11. Akaike’s information criterion in R There are number of functions in R that uses AIC. The simplest of them is extractAIC. This function gives the number of parameters as well as AIC. Another function is step. In its simplest form this function removes one at a time a parameter and calculates the AIC. If removal of a parameter decreases AIC then this parameter is removed from further consideration. The procedure is applied iteratively. In the following slide a result of step function is shown.

  12. Akaike’s information criterion Stage 1: Procedure removes one parameter at a time. Removal of height5 reduces AIC Start: AIC=-40.48 weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5) Df Sum of Sq RSS AIC - I(height^5) 1 0.050 0.504 -40.910 - I(height^4) 1 0.052 0.506 -40.840 - I(height^3) 1 0.055 0.508 -40.773 - I(height^2) 1 0.057 0.510 -40.708 - height 1 0.059 0.513 -40.646 <none> 0.454 -40.482 Step: AIC=-40.91 weight ~ height + I(height^2) + I(height^3) + I(height^4) Df Sum of Sq RSS AIC <none> 0.504 -40.910 - height 1 0.196 0.700 -37.979 - I(height^2) 1 0.206 0.710 -37.759 - I(height^3) 1 0.217 0.721 -37.536 - I(height^4) 1 0.231 0.734 -37.256 Call: lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4)) Coefficients: (Intercept) height I(height^2) I(height^3) I(height^4) 8.824e+03 -5.552e+02 1.319e+01 -1.389e-01 5.507e-04 Stage 2: Procedure tries further reduce the number of parameters. But the “best” model is the full model. Final stage: The “best” model is given

  13. Akaike’s information criterion: further improvements There are several additional forms of AIC. They try to correct to small samples. For example corrected AIC has an expression: R does not produce AICc (I am not aware if it produces). However it can be calculated using (for example) extractAIC(lm4)+2*p*(p+1)/(n-p-1) Where n is the number of observations and p is the number of parameters. For example for fourth order polynomial fit for weight <–> height relationship we can use: lm4=lm(weight ~ height + I(height^2) + I(height^3) + I(height^4)) extractAIC(lm4)+2*5*6/(15-5-1) It will produce the value -34.24 and for the reduced model lm3=lm(weight ~ height + I(height^2) + I(height^3)) extractAIC(lm3)+2*4*5/(15-4-1) It will produce -33.26. Since more complex model has lower AICc, it is preferred.

  14. AIC, BIC Both AIC and BIC attempt to produce minimal best model that describes the observations as best as possible. BIC reduces (prefers simpler model) more than AIC. In general in applications it is better to use common sense and look at the parameters very carefully. If you can find interpretation of all terms and reasons for them being there then they may be valid. Even when the “best” model has been selected it would be better to come back to model selection again if new observations become available. Absolute values of AIC (or BIC) are not important. Differences between AIC-s for different model is important. AIC (and BIC) can be used to compare models with the same observations. If observations change (for example few outliers are removed) then AIC should be applied again.

  15. Cross-validation If we have a sample of observations, can we use this sample and choose among given models. Cross validation attempts to reduce overfitting thus helps model selection. Description of cross-validation: We have a sample of the size n – (yi,xi) . • Divide the sample into K roughly equal size parts. • For the k-th part, estimate parameters using K-1 parts excluding k-th part. Calculate prediction error for k-th part. • Repeat it for all k=1,2,,,K and combine all prediction errors and get cross-validation prediction error. If K=n then we will have leave-one-out cross-validation technique. Let us denote an estimate at the k-th step by k (it is a vector of parameters). Let k-th subset of the sample be Akand number of points in this subset is Nk.. Then prediction error per observation is: Then we would choose the function that gives the smallest prediction error. We can expect that in future when we will have new observations this function will give smallest prediction error. This technique is widely used in modern statistical analysis. It is not restricted to least-squares technique. Instead of least-squares we could could use other techniques such as maximum-likelihood, Bayesian estimation, M-estimation.

  16. Cross-validation in R There is no cross-validation method for lm in R but there is a cross-validation function for glm (generalised linear method). This method requires that resoponses are stored in the output from model fitting. We can change a little bit the output from lm and then use cross-validation for glm. Following sequence of commands will allow us to do this (it is just an example): lm1 = lm(weight~height) lm1$y = weight # Add responses to the output from lm cv.glm(lm1)$delta # Do cross-validation cv.glm function by default uses leave one out cross-validation.

  17. Cross-validation in R There is no cross-validation method for lm in R but there is a cross-validation function for glm (generalised linear method). This method requires that resoponses are stored in the output from model fitting. We can change a little bit the output from lm and then use cross-validation for glm. Following sequence of commands will allow us to do this (it is just an example): lm1 = lm(weight~height) lm1$y = weight # Add responses to the output from lm cv.glm(lm1)$delta # Do cross-validation cv.glm function by default uses leave one out cross-validation.

  18. Cross-validation in R Let us apply cross-valdiation for weight~height relationship. require(boot) require(MASS) lm1=lm(weight~height,data=women) lm1$y=women$weight cv.glm(women,lm1)$delta This sequence of command will produce the following. So prediction error is 3.04. 1 1 3.040776 3.002450 lm1=lm(weight~height+I(height^2),data=women) lm1$y=women$weight cv.glm(women,lm1)$delta It produces 1 1 0.2198339 0.2156849 Prediction error is reduced to 0.22. Further application will produce prediction errors 0.15, 0.13, 0.18. The minimum seems to be when polynomial of fourth order is used. However differences between third and fourth order is not as substantial as that between first and second order. Depending on circumstances we would choose models as polynomila of second, third or fourth order.

  19. R commands lm – model fit summary (or summary.lm) – will give summary about the model including p-values for each model parameter extractAIC – to extract Akaikes information criterion extractAIC – with k=log(n) as an argument it will produce Bayesian information criterion stepAIC (or step) – will use AIC to step by step model comparision add1 – will add one parameter at a time and calculate AIC drop1 – will drop one parameter at a time and calculate AIC cv.glm – from the library boot. It is suitable for the objects produced by glm, however slight modification allows it to be used for lm object also.

  20. References • Stuart, A., Ord, KJ, Arnold, S (1999) Kendall’s advanced theory of statistics, Volume 2A • Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for experimenters • Berthold, MJ and Hand, DJ. Intelligent Data Analysis • Dalgaard, Introductury statistics with R

  21. Exercise 3 Take data set bigcity and analyse it as using linear model from the library boot. Write a report.

More Related