1 / 48

Detecting and reducing multicollinearity

Detecting and reducing multicollinearity. Detecting multicollinearity. Common methods of detection. Realized effects (changes in coefficients, changes in standard errors of coefficients, changes in sequential sums of squares) of multicollinearity.

onan
Download Presentation

Detecting and reducing multicollinearity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting and reducing multicollinearity

  2. Detecting multicollinearity

  3. Common methods of detection • Realized effects (changes in coefficients, changes in standard errors of coefficients, changes in sequential sums of squares) of multicollinearity. • Non-significant t-tests for all of the slopes but a significant overall F-test. • Significant correlations among pairs of predictor variables (correlations, matrix scatter plots). • Variance inflation factors (VIF).

  4. where is the R2 value obtained by regressing the kth predictor on the remaining predictors. The first variance at issue For the model: the variance of the estimated coefficient bk is:

  5. The second variance at issue For the model: the variance of the estimated coefficient bk is:

  6. The ratio of the two variances

  7. where is the R2 value obtained by regressing the kth predictor on the remaining predictors. Variance inflation factors The variance inflation factor for the kth predictor is:

  8. Variance inflation factors (VIFk) • A measure of how much the variance of the estimated regression coefficient bk is “inflated” by the existence of correlation among the predictor variables in the model. • VIFs exceeding 4 warrant investigation. • VIFs exceeding 10 are signs of serious multicollinearity.

  9. Blood pressure example n = 20 hypertensive individuals p-1 = 6 predictor variables

  10. Blood pressure example BP Age Weight BSA Duration Pulse Age 0.659 Weight 0.950 0.407 BSA 0.866 0.378 0.875 Duration 0.293 0.344 0.201 0.131 Pulse 0.721 0.619 0.659 0.465 0.402 Stress 0.164 0.368 0.034 0.018 0.312 0.506 Blood pressure (BP) is the response.

  11. Regress y = BP on all 6 predictors Predictor Coef SE Coef T P VIF Constant -12.870 2.557 -5.03 0.000 Age 0.70326 0.04961 14.18 0.000 1.8 Weight 0.96992 0.06311 15.37 0.000 8.4 BSA 3.776 1.580 2.39 0.033 5.3 Dur 0.06838 0.04844 1.41 0.182 1.2 Pulse -0.08448 0.05161 -1.64 0.126 4.4 Stress 0.005572 0.003412 1.63 0.126 1.8 S = 0.4072 R-Sq = 99.6% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression 6 557.844 92.974 560.64 0.000 Residual Error 13 2.156 0.166 Total 19 560.000

  12. Regress x2 = weight on 5 predictors Predictor Coef SE Coef T P VIF Constant 19.674 9.465 2.08 0.057 Age -0.1446 0.2065 -0.70 0.495 1.7 BSA 21.422 3.465 6.18 0.000 1.4 Dur 0.0087 0.2051 0.04 0.967 1.2 Pulse 0.5577 0.1599 3.49 0.004 2.4 Stress -0.02300 0.01308 -1.76 0.101 1.5 S = 1.725 R-Sq = 88.1% R-Sq(adj) = 83.9% Analysis of Variance Source DF SS MS F P Regression 5 308.839 61.768 20.77 0.000 Residual Error 14 41.639 2.974 Total 19 350.478

  13. The variance inflation factor calculated by its definition The variance of the weight coefficient is inflated by a factor of 8.40 due to the existence of correlation among the predictor variables in the model.

  14. The pairwise correlations BP Age Weight BSA Duration Pulse Age 0.659 Weight 0.950 0.407 BSA 0.866 0.378 0.875 Duration 0.293 0.344 0.201 0.131 Pulse 0.721 0.6190.659 0.465 0.402 Stress 0.164 0.368 0.034 0.018 0.312 0.506 Blood pressure (BP) is the response.

  15. Regress y =BP on age, weight, duration and stress Predictor Coef SE Coef T P VIF Constant -15.870 3.195 -4.97 0.000 Age 0.68374 0.06120 11.17 0.000 1.5 Weight 1.03413 0.03267 31.65 0.000 1.2 Dur 0.03989 0.06449 0.62 0.545 1.2 Stress 0.002184 0.003794 0.58 0.573 1.2 S = 0.5505 R-Sq = 99.2% R-Sq(adj) = 99.0% Analysis of Variance Source DF SS MS F P Regression 4 555.45 138.86 458.28 0.000 Residual Error 15 4.55 0.30 Total 19 560.00

  16. Reducing data-based multicollinearity

  17. Data-based multicollinearity • Multicollinearity that results from a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which you collect the data.

  18. Some methods • Modify the regression model by eliminating one or more predictor variables. • Collect additional data under different experimental or observational conditions.

  19. (Modified!) Allen Cognitive Level (ACL) Study • Relationship of ACL test to level of pathology in a set of 23 patients in a hospital psychiatry unit: • Response y = ACL score • x1 = vocabulary (Vocab) score on Shipley Institute of Living Scale • x2 = abstraction (Abstract) score on Shipley Institute of Living Scale • x3 = score on Symbol-Digit Modalities Test (SDMT)

  20. Allen Cognitive Level (ACL) Study on 23 patients

  21. Strong correlation between Vocab and Abstract Pearson correlation of Vocab and Abstract = 0.990

  22. Regress y = ACL on SDMT, Vocab, and Abstract Predictor Coef SE Coef T P VIF Constant 3.747 1.342 2.79 0.012 SDMT 0.02326 0.01273 1.83 0.083 1.7 Vocab 0.0283 0.1524 0.19 0.855 49.3 Abstract -0.0138 0.1006 -0.14 0.892 50.6 S = 0.7344 R-Sq = 26.5% R-Sq(adj) = 14.8% Analysis of Variance Source DF SS MS F P Regression 3 3.6854 1.2285 2.28 0.112 Residual Error 19 10.2476 0.5393 Total 22 13.9330

  23. Allen Cognitive Level (ACL) Study on 69 patients

  24. Plot after having collected more data Pearson correlation of Vocab and Abstract = 0.698

  25. Regress y = ACL on SDMT, Vocab, and Abstract Predictor Coef SE Coef T P VIF Constant 3.9463 0.3381 11.67 0.000 SDMT 0.027404 0.007168 3.82 0.000 1.6 Vocab -0.01740 0.01808 -0.96 0.339 2.1 Abstract 0.01218 0.01159 1.05 0.297 2.2 S = 0.6878 R-Sq = 28.6% R-Sq(adj) = 25.3% Analysis of Variance Source DF SS MS F P Regression 3 12.3009 4.1003 8.67 0.000 Residual Error 65 30.7487 0.4731 Total 68 43.0496

  26. Reducing structural multicollinearity In context of polynomial regression models

  27. Structural multicollinearity • Multicollinearity that is a mathematical artifact caused by creating new predictors from other predictors, such as, creating the predictor x2 from the predictor x.

  28. Example • (General research question) What is impact of exercise on human immune system? • (Specific research question) How is amount of immunoglobin in blood (y) related to maximal oxygen uptake (x)?

  29. Scatter plot

  30. A quadratic polynomial regression function • where: • yi = amount of immunoglobin in blood (mg) • xi = maximal oxygen uptake (ml/kg) • typical assumptions about error terms (“INE”)

  31. Estimated quadratic function

  32. Interpretation of the regression coefficients • If 0 is a possible x value, then b0 is the predicted response. Otherwise, interpretation of b0 is meaningless. • b1 is the slope of the tangent line at x = 0. • b2 indicates the up/down direction of curve • b2 < 0 means curve is concave down • b2 > 0 means curve is concave up

  33. Regress y = iggon oxygen and oxygen2 The regression equation is igg = - 1464 + 88.3 oxygen - 0.536 oxygensq Predictor Coef SE Coef T P VIF Constant -1464.4 411.4 -3.56 0.001 oxygen 88.31 16.47 5.36 0.000 99.9 oxygensq -0.5362 0.1582 -3.39 0.002 99.9 S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3% Analysis of Variance Source DF SS MS F P Regression 2 4602211 2301105 203.16 0.000 Residual Error 27 305818 11327 Total 29 4908029

  34. Structural multicollinearity Pearson correlation of oxygen and oxygensq = 0.995

  35. “Center” the predictors Mean of oxygen = 50.637 oxygen oxcent oxcentsq 34.6 -16.037 257.185 45.0 -5.637 31.776 62.3 11.663 136.026 58.9 8.263 68.277 42.5 -8.137 66.211 44.3 -6.337 40.158 67.9 17.263 298.011 58.5 7.863 61.827 35.6 -15.037 226.111 49.6 -1.037 1.075 33.0 -17.637 311.064

  36. Wow! It really works! Pearson correlation of oxcent and oxcentsq = 0.219

  37. where denotes the centered predictor A better quadratic polynomial regression function • and: • yi = amount of immunoglobin in blood (mg) • typical assumptions about error terms (“INE”)

  38. Regress y = iggon oxcent and oxcent2 The regression equation is igg = 1632 + 34.0 oxcent - 0.536 oxcentsq Predictor Coef SE Coef T P VIF Constant 1632.20 29.35 55.61 0.000 oxcent 34.000 1.689 20.13 0.000 1.1 oxcentsq -0.5362 0.1582 -3.39 0.002 1.1 S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3% Analysis of Variance Source DF SS MS F P Regression 2 4602211 2301105 203.16 0.000 Residual Error 27 305818 11327 Total 29 4908029

  39. Interpretation of the regression coefficients • b0 is predicted response at the predictor mean. • b1 is the estimated slope of the tangent line at the predictor mean; and, often, similar to the estimated slope in the simple model. • b2 indicates the up/down direction of curve • b2 < 0 means curve is concave down • b2 > 0 means curve is concave up

  40. Estimated regression function

  41. Similar estimates of coefficients from first-order linear model

  42. The relationship between the two forms of the model Centered model: Original model: where:

  43. Mean of oxygen = 50.637

  44. Model evaluation

  45. Model evaluation

  46. Model use: What is predicted IgG if maximal oxygen uptake is 90? Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7) XX X denotes a row with X values away from the center XX denotes a row with very extreme X values Values of Predictors for New Observations New Obs oxcent oxcentsq 1 39.4 1549 There is an even greater danger in extrapolation when modeling data with a polynomial function, because of changes in direction.

  47. The hierarchical approachto model fitting Widely accepted approach is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate. Is a first-order linear model (“line”) adequate?

  48. and not this one: The hierarchical approachto model fitting But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained. That is, if a quadratic term was significant, you would use this regression function:

More Related