1 / 43

Multiple comparisons Basics of linear regression Model selection

Multiple comparisons Basics of linear regression Model selection. Multiple comparison. One comparison - use t-test or equivalent Few comparisons - use Bonferroni Many comparisons - Tukey ’ s honest significant differences, Holm, Scheffe or others. Bonferroni correction.

margiel
Download Presentation

Multiple comparisons Basics of linear regression Model selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple comparisons • Basics of linear regression • Model selection

  2. Multiple comparison • One comparison - use t-test or equivalent • Few comparisons - use Bonferroni • Many comparisons - Tukey’s honest significant differences, Holm, Scheffe or others

  3. Bonferroni correction If there is only one comparison then we can use t-test or intervals based on t distribution. However if the number of tests increases then probability that significant effect will be observed when there is no significant effect becomes very large. It can be calculated using 1-(1-)n, where  is significance level and n is the number of comparisons. For example if the significance level is 0.05 and the number of comparisons (tests) is 10 then the probability that at least one significant effect will be detected by chance is 1-(1-0.05)10=0.40. Bonferroni suggested using /n instead of  for designing simultaneous confidence intervals. It means that the intervals will be calculated using i-j  t/(2n)(se of comparison)0.5 Clearly when n becomes very large these intervals will become very conservative. Bonferroni correction is recommended when only few effects are compared. pairwise.t.test in R can do Bonferroni correction. If Bonferoni correction is used then p values are multiplied by the number of comparisons (Note that of we are testing effects of I levels of factor then the number of comparisons is I(I-1)/2

  4. Bonferroni correction: Example Let us take the example dataset - poisons and try Bonferroni correction for each factor: pairwise.t.test(poison,treat,”none”,data=poisons) 1 2 2 0.32844 - 3 3.3e-05 0.00074 pairwise.t.test(poison,treat,”bonferroni”,data=poisons) 1 2 2 0.9853 - 3 1e-04 0.0022 As it is seen each p-value is multiplied by the number of comparisons 3*2/2 = 3. If the corresponding adjusted p-value becomes more than one then it is truncated to one. It says that there are significant differences between effects of poisons 1 and 3 and between 2 and 3. Difference between effects of poisons 1 and 2 is not significant.

  5. Holm correction Another correction for multiple tests – Holm’s correction is less conservative than Bonferroni correction. It is a modification of Bonferroni correction. It works in a sequential manner. Let us say we need to make n comparisons and significant level is . Then we calculate p values for all of them and sort them in ascending order and apply the following procedure: • set i = 1 • If (n-i+1) pi<  then it is significant, otherwise it is not. • If a comparison number i is significant then increment i by one and if i  n go to the step 2 The number of significant effects will be equal to i where the procedure stops. When reporting p-values Holm correction works similar to Bonferroni but in a sequential manner. If we have m comparisons then the smallest p value is multiplied by m, the second smallest is multiplied by m-1, j-th comparison is multiplied by m-j+1

  6. Holm correction: example Let us take the example - the data set poisons and try Holm correction for each factor: pairwise.t.test(poison,treat,”none”,data=poisons) 1 2 2 0.32844 - 3 3.3e-05 0.00074 pairwise.t.test(poison,treat,”holm”,data=poisons) # this correction is the default in R 1 2 2 0.3284 - 3 1e-04 0.0015 The smallest is multiplied by 3 the second by 2 and the largest by 1 It says that there is significant differences between effects of poisons 1 and 3 and between 2 and 3. Difference between effects of poisons 1 and 2 is not significant.

  7. Tukey’s honest significant difference This test is used to calculate simultaneous confidence intervals for differences of all effects. Tukey’s range distribution. If we have a random sample of size N from normal distribution then the distribution of stundentised range - (maxi(xi)-mini(xi))/sd is called Tukey’s distribution. Let us say we want to test if i- j is 0. For simultaneous 100% confidence intervals we need to calculate for all pairs lower and upper limits of the interval using: difference ± ql, sd (1/Ji+1/Jj)0.5 /2 Where q is the -quantile of Tukey’s distribution, Ji and Jjare the numbers of observations used to calculate i and j, sd is the standard deviation, l is the number of levels to be compared and  is the degree of freedom used to calculate sd.

  8. Tukey’s honest significant difference R command to perform this test is TukeyHSD. It takes an object derived using aov as an input and gives confidence intervals for all possible differences. For example for poison data (if you want to use this command you should use aov for analysis) lm1 = aov(time~poison+treat,data=poisons) TukeyHSD(lm1) $poison diff lwr upr p adj 2-1 -0.073125 -0.2089936 0.0627436 0.3989657 # insignifacnt 3-1 -0.341250 -0.4771186 -0.2053814 0.0000008 # significant 3-2 -0.268125 -0.4039936 -0.1322564 0.0000606 # significant $treat diff lwr upr p adj B-A 0.36250000 0.18976135 0.53523865 0.0000083 #sginificant C-A 0.07833333 -0.09440532 0.25107198 0.6221729 #insgnific D-A 0.22000000 0.04726135 0.39273865 0.0076661 #significant C-B -0.28416667 -0.45690532 -0.11142802 0.0004090 #significant D-B -0.14250000 -0.31523865 0.03023865 0.1380432 #insignific D-C 0.14166667 -0.03107198 0.31440532 0.1416151 #insignific

  9. More tests Mann-Whitney-Wilcoxon test – non-parametric test for median Kolmogorov-Smirnoff test – test if distribution of two random variables are same Grubbs test – tests for outliers (more in R package – outliers)

  10. Basics of linear regression analysis • Purpose of linear models • Least-squares solution for linear models • Analysis of diagnostics

  11. Reason for linear models Purpose of regression is to reveal statistical relations between input and output variables. Statistics cannot reveal functional relationship. It is purpose of other scientific studies. Statistics can help validation of various functional relationship (models). Let us assume that we suspect that functional relationship has a form: where  is a vector of unknown parameters, x=(x1,x2,,,xp) a vector of controllable parameters, and y is output,  is an error associated with the experiment. Then we can set up experiments for various values of x and get output (or response) for them. If the number of experiments is n then we will have n output values. Denote them as a vector y=(y1,y2,,,yn). Purpose of statistics is to evaluate parameter vector using input and output values. If function f is a linear function of the parameters and errors are additive then we are dealing with linear model. For this model we can write Linear model is linearly dependent on parameters but not on input variables. For example is a linear model. But is not.

  12. Assumptions Basic assumptions for analysis of linear model are: • the model is linear in parameters • the error structure is additive • Random errors have 0 mean, equal variances and they are uncorrelated. These assumptions are sufficient to deal with linear models. Uncorrelated with equal variance assumptions (number 3) can be removed. Then the treatments becomes a little bit more complicated. Note that for general solution, normal distribution of errors assumption is not used. This assumption is necessary to design test statistics. If this assumption breaks down then we can use bootstrap or other approximate techniques to design test statistic. These assumptions can be written in a vector form: where y, 0,  are vectors and I and X are matrices. The matrix X is called a design matrix, input matrix etc. I is nxn identity matrix.

  13. Solution for the general case Problem of linear model fitting into the data is reduced to mimimisation: Its solution is: This estimation is unbiased: Covariance matrix of the estimator is: And variance of the error is estimated as: Note that correlation does not depend on the data directly, it only depends on the amount of data and how it has been collected (design of experiment is important).

  14. Standard diagnostics Before starting to model • Visualisation of the data: • plotting predictor vs observations. These plots may give a clue about the relationship, outliers • Smootheners After modelling and fitting • Fitted values vs residuals. It may help to identify outliers, correctness of the model • Normal QQ plot of residuals. It may help to check distribution assumptions • Cook’s distance, reveals outliers, checks correctness of the model • Model assumptions - t tests given by the default print of lm Checking model and designing tests • Cross-validation. If you have a choice of models then cross-validation may help to choose the “best” model

  15. Visualisation prior to modelling Different type of datasets may require different visualisation tools. For simple visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used. Visualisation prior to modelling may help to propose a model (form of the functional relationship between input and output, probability distribution of observation etc) For example for dataset women - where weights and heights for 15 cases have been measured. plot and pairs commands produce these plots:

  16. After modelling: linear models After modelling the results should be analysed. For example attach(women) lm1 = lm(weight~height) It means that we want a liner model (we believe that dependence of weight on height is linear): weight=0+1*height Results could be viewed using lm1 summary(lm1) The last command will produce significance of various coefficients also. Significance levels produced by summary should be considered carefully. If there are many coefficients then the chance that one “significant” effect is observed is very high.

  17. After modelling: linear models It is a good idea to plot data and fitted model, and differences between fitted and observed values on the same graph. For linear models with one predictor it can be done using: plot(weight,height) abline(lm1) segements(weight,fitted(lm1),weight,height) This plot already shows some systematic differences. It is an indication that model may need to be revised.

  18. Checking validity of the model: standard tools Plotting fitted values vs residual, QQ plot and Cook’s distance can give some insight into model and how to improve it. Some of these plots can be done using plot(lm1)

  19. Confidence and prediction bands Confidence bands are those where true line would belong with 1-α probability Prediction bands are bands where future observation would full with probability of 1-α: Prediction bands are wider than confidence bands.

  20. Prediction and confidence bands lm1 = lm(height~weight) pp = predict(lm1,interval='p') pc = predict(lm1,interval='c') plot(weight,height,ylim=range(height,pp)) n1=order(weight) matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red') matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red’) These commands produce two sets of bands: narrow and wide. Narrow band corresponds to confidence bands and wide band is prediction band

  21. Analysis of diagnostics Residuals and hat matrix: Residuals are differences between observation and fitted values: H is called a hat matrix. Diagonal terms hi are leverage of the observations. If these values are close to one then that fitted value is determined by this observation. Sometimes hi’=hi/(1-hi) is used to enhance high leverages. Q-Q plot can be used to check normality assumption. If assumption on distribution is correct then this plot should be nearly linear. If the distribution is normal then tests designed for normal distributions can be used. Otherwise bootstrap may be more useful to derive the desired distributions.

  22. Analysis of diagnostics: Cont. Other analysis tools include: Where hi is leverage, hi’ is enhanced leverage, s2 is unbiased estimator of 2, si2 is unbiased estimator of 2 after removal of i-th observation

  23. R-squared One of the statistics used for goodness of fit is R-squared. It is defined as: Adjusted R-squared: n - the number of observations, p - the number of parameters

  24. Several potential problems with linear model and simple regression analysis • Interpretation Regression analysis can tell us that there is a relationship between input and output. It does not say one causes another • Outliers Use robust M-estimators • Singular or near singular cases Ridge regression or SVD. In underdetermined systems Partial least squares may be helpful • Distribution assumptions Generalised linear models Random or mixed effect models Data transformation • Non-linear model Box-Cox transformation Non-linear model fitting for example using R command nlm

  25. Model Selection • Model selection based on t-distribution • Information criteria • Cross-validation

  26. Introduction When modelling relationship between input and out variables we need to make decision about the form of the model. If there are several input parameters then modelling decision could be: which input variable is important in predicting the output. Let us say that we have several input variables and the proposed model is linear: y=β0 + β1 x1+β2 x2 + ε If some of the coefficients are 0 then corresponding input variable has no effect on predicting response – y. Another situation may arise when we want to explain output using polynomials on input variables. y=β0 + β1 x1+β2 x12 + ε In these cases we may want to know if some of the coefficients are 0. E.g. if β2is 0 then relationship between input and output is linear. Otherwise relationship may be higher order polynomials. In general if we have functions of increasing complexity gkthen it is likely that functions of higher complexity will give us better agreement between input and output. One of the ways of calculating agreement is residual sum of square (RSS): RSS = Σ(yi-gk(xi))2 Where i is the observation number and k is the function (model) number.

  27. Introduction One very simple way of testing if a particular coefficient of the linear model is 0 is testing hypothesis: H0:βj=0 versus H1:βj≠0 As we already know, to do this test we need to know the distribution of the estimated parameters. We know that for large samples the distribution of the parameters asymptotically is normal. We also know that covariance matrix of this distribution is: Where all xi0 are 1. The number of observations is n and the number of parameters is p. Under H0 the distribution of βjis normal distribution with mean 0 and the variance equal to the diagonal term of the covariance matrix. Therefore the distribution of is t-distribution with n-p degrees of freedom.

  28. Example Let us take an example of weight vs height and look at the results of linear model fitting using polynomials on input variable (height) up to fifth order. Plots show that second order polynomial fits sufficiently well. After the second order improvements are marginal. Natural question would be which model is better, can we justify more complex model based on the observations we have.

  29. Model selection using t-distribution: summary of lm method The distribution of estimated parameters are used in the summary of lm command. Let us say we want to see if some of the coefficients of the model: y=β0 + β1 x1+β2 x12 + β1 x13+β2 x14 + β1 x15 + ε After fitting the model using lm5 = lm(weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5)) summary(lm5) will produce a table that will have information about significance for each estimated parameter: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.144e+04 8.300e+04 1.102 0.299 height -6.947e+03 6.418e+03 -1.083 0.307 I(height^2) 2.107e+02 1.982e+02 1.063 0.315 I(height^3) -3.187e+00 3.058e+00 -1.042 0.325 I(height^4) 2.403e-02 2.355e-02 1.020 0.334 I(height^5) -7.224e-05 7.246e-05 -0.997 0.345 Residual standard error: 0.2245 on 9 degrees of freedom Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998 F-statistic: 1.335e+04 on 5 and 9 DF, p-value: < 2.2e-16

  30. Model selection using t-distribution More details of this table will be described in the tutorial. Here we are interested in not-rejecting null-hypothesis (that particular parameter is 0). If we are using t-distribution to to make such decision then we need to take the largest p-value and remove the corresponding coefficient from the model. In this case it is the coefficient corresponding to height5. After removing that we can do model fitting using the reduced model. We use: lm4 = lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4)) summary(lm4) It will produce significance levels for coefficients Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.824e+03 4.550e+03 1.939 0.0812 . height -5.552e+02 2.814e+02 -1.973 0.0768 . I(height^2) 1.319e+01 6.515e+00 2.024 0.0705 . I(height^3) -1.389e-01 6.693e-02 -2.076 0.0646 . I(height^4) 5.507e-04 2.574e-04 2.140 0.0581 . Now we cannot remove any of the coefficients, p-values are not large enough to say that particular coefficient is insignificant.

  31. Model selection using t-distribution summary command after lm command allows a simple way of model selection. It would work as follows: • Specify full model (it may be a problem. Model always can be made more complex) • Fit the model using lm • Use summary and remove the coefficient with the highest p-value (Say values that are more than 0.1. This value is arbitrary and depends how much would one want to reduce the model) • Repeat 2) and 3) until model can no longer be reduced Warning: Using multiple hypothesis testing may produce misleading results. For example if we are working on a significance level αand we want to test 10 hypothesis then the probability that at least one of the hypothesis will be significant when it is not is 1-(1-α)10. If α=0.05 then this probability will be 0.4 and if we are testing 100 hypothesis then this probability will be 0.99. When we used summary we deliberately did not use rejection of null-hypothesis. We used high p-values not to reject null-hypothesis.

  32. Akaike’s information criterion Another model selection tool is Akaike’s information criterion (AIC). This technique is based on the value of likelihood at the maximum and uses the number of parameters to penalise complex models. Let us remind us the maximum likelihood technique. Likelihood is proportional to the conditional distribution of observations (outputs, responses) given parameters of the model: In maximum likelihood estimation we maximise the likelihood function wrt the parameters of the model. By making model more and more complex we can get larger and larger likelihood values. But is more complex model justified? Once we have maximised the likelihood function we can calculate the value of the likelihood at the maximum. Akaike’s information criterion as implemented in R uses the following formula: Where c is a constant (may depend on the number of observations) that defines various forms of information criterion, p is the effective number of parameters in the model. In linear model p is equal to the number of parameters. When c=2 then we have the classic AIC and when c=log(n) then we have Bayesian information criterion (BIC).

  33. Akaike’s information criterion To apply AIC to the cases of linear models we need to specify likelihood function and find its value at the maximum. We can consider linear models with least square minimisation as a special case of maximum likelihood. Where σιis the standard deviation of the i-th observation. When all σιare equal to each other (denote them as σ) and they are unknown then we can write: Once we have found the maximum wrt β we can carry on and find the maximum wrt σ. Note that the maximum likelihood estimator for the variance is not unbiased. Unbiased estimator is

  34. Akaike’s information criterion Let us find the value of the likelihood at the maximum. We will have: If we put it in the expression of AIC we will have: In model selection the number of observations do not change, so AIC can be written: If we use more complex model then RSS will be smaller. On the other hand model with smaller number of parameters may have better predictive power. AIC attempts to combine these two conflicting ideas together. When comparing two models, model with smaller AIC is considered to be better.

  35. Akaike’s information criterion in R There are number of functions in R that uses AIC. The simplest of them is extractAIC. This function gives the number of parameters as well as AIC. Another function is step. In its simplest form this function removes one at a time a parameter and calculates the AIC. If removal of a parameter decreases AIC then this parameter is removed from further consideration. The procedure is applied iteratively. In the following slide a result of step function is shown.

  36. Akaike’s information criterion Stage 1: Procedure removes one parameter at a time. Removal of height5 reduces AIC Start: AIC=-40.48 weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5) Df Sum of Sq RSS AIC - I(height^5) 1 0.050 0.504 -40.910 - I(height^4) 1 0.052 0.506 -40.840 - I(height^3) 1 0.055 0.508 -40.773 - I(height^2) 1 0.057 0.510 -40.708 - height 1 0.059 0.513 -40.646 <none> 0.454 -40.482 Step: AIC=-40.91 weight ~ height + I(height^2) + I(height^3) + I(height^4) Df Sum of Sq RSS AIC <none> 0.504 -40.910 - height 1 0.196 0.700 -37.979 - I(height^2) 1 0.206 0.710 -37.759 - I(height^3) 1 0.217 0.721 -37.536 - I(height^4) 1 0.231 0.734 -37.256 Call: lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4)) Coefficients: (Intercept) height I(height^2) I(height^3) I(height^4) 8.824e+03 -5.552e+02 1.319e+01 -1.389e-01 5.507e-04 Stage 2: Procedure tries further reduce the number of parameters. But the “best” model is the full model. Final stage: The “best” model is given

  37. Akaike’s information criterion: further improvements There are several additional forms of AIC. They try to correct to small samples. For example corrected AIC has an expression: R does not produce AICc (I am not aware if it produces). However it can be calculated using (for example) extractAIC(lm4)+2*p*(p+1)/(n-p-1) Where n is the number of observations and p is the number of parameters. For example for fourth order polynomial fit for weight <–> height relationship we can use: lm4=lm(weight ~ height + I(height^2) + I(height^3) + I(height^4)) extractAIC(lm4)+2*5*6/(15-5-1) It will produce the value -34.24 and for the reduced model lm3=lm(weight ~ height + I(height^2) + I(height^3)) extractAIC(lm3)+2*4*5/(15-4-1) It will produce -33.26. Since more complex model has lower AICc, it is preferred.

  38. AIC, BIC Both AIC and BIC attempt to produce minimal best model that describes the observations as best as possible. BIC reduces (prefers simpler model) more than AIC. In general in applications it is better to use common sense and look at the parameters very carefully. If you can find interpretation of all terms and reasons for them being there then they may be valid. Even when the “best” model has been selected it would be better to come back to model selection again if new observations become available. Absolute values of AIC (or BIC) are not important. Differences between AIC-s for different models are important. AIC (and BIC) can be used to compare models with the same observations. If observations change (for example few outliers are removed or new observations become available) then AIC should be applied for all models again.

  39. Cross-validation If we have a sample of observations, can we use this sample and choose among given models. Cross validation attempts to reduce overfitting thus helps model selection. Description of cross-validation: We have a sample of the size n – (yi,xi) . • Divide the sample into K roughly equal size parts. • For the k-th part, estimate parameters using K-1 parts excluding k-th part. Calculate prediction error for k-th part. • Repeat it for all k=1,2,,,K and combine all prediction errors and get cross-validation prediction error. If K=n then we will have leave-one-out cross-validation technique. Let us denote an estimate at the k-th step by k (it is a vector of parameters). Let k-th subset of the sample be Akand number of points in this subset is Nk.. Then prediction error per observation is: Then we would choose the function that gives the smallest prediction error. We can expect that in future when we will have new observations this function will give smallest prediction error. This technique is widely used in modern statistical analysis. It is not restricted to least-squares technique. Instead of least-squares we could could use other techniques such as maximum-likelihood, Bayesian estimation, M-estimation.

  40. Cross-validation in R There is no cross-validation method for lm in R but there is a cross-validation function for glm (generalised linear method). This method requires that resoponses are stored in the output from model fitting. We can change a little bit the output from lm and then use cross-validation for glm. Following sequence of commands will allow us to do this (it is just an example): lm1 = lm(weight~height) lm1$y = weight # Add responses to the output from lm cv.glm(lm1)$delta # Do cross-validation cv.glm function by default uses leave one out cross-validation.

  41. Cross-validation in R Let us apply cross-valdiation for weight~height relationship. require(boot) require(MASS) lm1=lm(weight~height,data=women) lm1$y=women$weight cv.glm(women,lm1)$delta This sequence of command will produce the following. So prediction error is 3.04. 1 1 3.040776 3.002450 lm1=lm(weight~height+I(height^2),data=women) lm1$y=women$weight cv.glm(women,lm1)$delta It produces 1 1 0.2198339 0.2156849 Prediction error is reduced to 0.22. Further application will produce prediction errors 0.15, 0.13, 0.18. The minimum seems to be when polynomial of fourth order is used. However differences between third and fourth order is not as substantial as that between first and second order. Depending on circumstances we would choose models as polynomila of second, third or fourth order.

  42. R commands lm – model fit summary (or summary.lm) – will give summary about the model including p-values for each model parameter extractAIC – to extract Akaikes information criterion extractAIC – with k=log(n) as an argument it will produce Bayesian information criterion stepAIC (or step) – will use AIC to step by step model comparision add1 – will add one parameter at a time and calculate AIC drop1 – will drop one parameter at a time and calculate AIC cv.glm – from the library boot. It is suitable for the objects produced by glm, however slight modification allows it to be used for lm object also.

  43. References • Berthold, M. and Hand, DJ (2003) “Intelligent data analysis” • Stuart, A., Ord, JK, and Arnold, S. (1991) “Kendall’s advanced Theory of statistics. Volume 2A. Classical Inference and the Linear models” Arnold publisher, London, Sydney, Auckland • Dalgaard, P (2002) “Introductory statistics with R”, Springer Publishers

More Related