1 / 51

Topic 16: Multicollinearity, Polynomial Regression, Interaction Models

Outline. MulticollinearityPolynomial RegressionInteraction models. An example (NKNW p261). The P value for ANOVA F-test is <.0001The P values for the individual regression coefficients are 0.1699, 0.2849, and 0.1896None of these are near our standard significance level of 0.05What is the expl

tave
Download Presentation

Topic 16: Multicollinearity, Polynomial Regression, Interaction Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Topic 16: Multicollinearity, Polynomial Regression, Interaction Models

    2. Outline Multicollinearity Polynomial Regression Interaction models

    3. An example (NKNW p261) The P value for ANOVA F-test is <.0001 The P values for the individual regression coefficients are 0.1699, 0.2849, and 0.1896 None of these are near our standard significance level of 0.05 What is the explanation? Multicollinearity!!!

    4. Multicollinearity Numerical analysis problem is that the matrix X?X is close to singular and is therefore difficult to invert accurately Statistical problem is that there is too much correlation among the explanatory variables and it is therefore difficult to determine the regression coefficients

    5. Multicollinearity Solve the statistical problem and the numerical problem will also be solved We want to refine a model that has redundancy in the explanatory variables Do this regardless if X?X can be inverted without difficulty

    6. Multicollinearity Extreme cases can help us understand the problems caused by multicollinearity Assume columns in X matrix were uncorrelated Type I and Type II SS will be the same The contribution of each explanatory variable to the model is the same whether or not the other explanatory variables are in the model

    7. Multicollinearity Suppose a linear combination of the explanatory variables is a constant Example: X1 = X2 ? X1-X2 = 0 Example: 3X1-X2 = 5 The Type II SS for the Xs involved will be zero

    8. An example

    9. Output

    10. Output

    11. Output

    12. Output

    13. Extent of multicollinearity This example had one explanatory variable equal to a linear combination of other explanatory variables This is the most extreme case of multicollinearity and is detected by statistical software because (X?X) does not have an inverse We are concerned with cases less extreme

    14. Effects of multicollinearity Regression coefficients are not well estimated and may be meaningless Similarly for standard errors of these estimates Type I SS and Type II SS will differ R2 and predicted values are usually ok

    15. Pairwise Correlations Pairwise correlations can be used to check for pairwise collinearity Recall NKNW p261 Cor(skinfold, thigh)=0.9218 Cor(skinfold, midarm) = 0.4578 Cor(thigh, midarm) = 0.0847 Cor(midarm,skinfold+thigh) = 0.995 See p 291 for change in coeff values depending on what variables are in the model

    16. Polynomial regression We can fit a quadratic, cubic, etc. relationship by defining squares, cubes, etc., in a data step and using them as additional explanatory variables We can do this with more than one explanatory variable When we do this we generally create a multicollinearity problem

    17. NKNW Example p302 Response variable is the life (in cycles) of a power cell Explanatory variables are Charge rate (3 levels) Temperature (3 levels) This is a designed experiment

    18. Input and check the data

    19. Output

    20. Create new variables and run the regression

    21. Output

    22. Output

    23. Conclusion Overall F significant, individual ts not significant ? multicollinearity problem Look at the correlations (proc corr) There are some very high correlations r(chrate,chrate2) = 0.99103 r(temp,temp2) = 0.98609 Correlation between powers of a variable

    24. A remedy We can remove the correlation between explanatory variables and their powers by centering Centering means that you subtract off the mean before squaring NKNW rescaled by standardizing (subtract the mean and divide by the standard deviation) but subtracting the mean is key here

    25. A remedy Use Proc Standard to center the explanatory variables Recompute the squares, cubes, etc., using the centered variables Rerun the regression analysis

    26. Proc standard

    27. Output

    28. Recompute squares and cross product

    29. Rerun regression

    30. Output

    31. Output

    32. Interaction Models With several explanatory variables, we need to consider the possibility that the effect of one variable depends on the value of another variable Special cases One binary variable (Y/N) and one continuous variable Two continuous variables

    33. One binary variable and one continuous variable X1 takes values 0 and 1 corresponding to two different groups X2 is a continuous variable Model: Y = 0 + 1X1 + 2X2 + 3X1X2 + ? When X1 = 0 : Y = 0 + 2X2 + ? When X1 = 1 : Y = (0 + 1)+ (2 + 3) X2 + ?

    34. One binary and one continuous 0 is the intercept for Group 1 0+ 1 is the intercept for Group 2 Similar relationship for slopes (2 and 3) H0: 1 = 3 = 0 tests the hypothesis that the regression lines are the same H0: 1 = 0 tests equal intercepts H0: 3 = 0 tests equal slopes

    35. NKNW Example p459 Y is number of months for an insurance company to adopt an innovation X1 is the size of the firm (a continuous variable X2 is the type of firm (a qualitative or categorical variable)

    36. The question X2 takes the value 0 if it is a mutual fund firm and 1 if it is a stock fund firm We ask whether or not stock firms adopt the innovation slower or faster than mutual firms We ask the question across all firms, regardless of size

    37. Plot the data

    38. Two symbols

    39. Interaction effects Interaction expresses the idea that the effect of one explanatory variable on the response depends on another explanatory variable In the NKNW example, this would mean that the slope of the line depends on the type of firm

    40. Are both lines the same ? Use the test statement Data a1; set a1; sizestoc=size*stock; Proc reg data=a1; model months=size stock sizestoc; test stock, sizestoc; run;

    41. Output

    42. Output

    43. Two parallel lines?

    44. Output

    45. Output

    46. Output

    47. Plot the two lines

    48. The plot

    49. Two continuous variables Y = 0 + 1X1 + 2X2 + 3X1X2 + ? Can be rewritten as follows Y = 0 + (1 + 3X2)X1 + 2X2 + ? Y = 0 + 1X1 + (2 + 3X1) X2 + ? The coefficient of one explanatory variable depends on the value of the other explanatory variable

    50. Constrained regression We may want to put a linear constraint on the regression coefficients, e.g. 1 = 1, or 1 = 2 We can do this by redefining our explanatory variables (data step). See eq 7.96 (p 316) Or we can use the RESTRICT statement in proc reg (e.g. restrict size=0; or restrict size=5*stock;)

    51. Last slide We went over NKNW 7.4 - 7.9 and 11.1 11.2. We used programs NKNW260.sas, cs3.sas, NKNW302.sas and NKNW459.sas to generate the output for today

More Related