490 likes | 733 Views
Detecting and reducing multicollinearity. Detecting multicollinearity. Common methods of detection. Realized effects (changes in coefficients, changes in standard errors of coefficients, changes in sequential sums of squares) of multicollinearity.
E N D
Common methods of detection • Realized effects (changes in coefficients, changes in standard errors of coefficients, changes in sequential sums of squares) of multicollinearity. • Non-significant t-tests for all of the slopes but a significant overall F-test. • Significant correlations among pairs of predictor variables (correlations, matrix scatter plots). • Variance inflation factors (VIF).
where is the R2 value obtained by regressing the kth predictor on the remaining predictors. The first variance at issue For the model: the variance of the estimated coefficient bk is:
The second variance at issue For the model: the variance of the estimated coefficient bk is:
where is the R2 value obtained by regressing the kth predictor on the remaining predictors. Variance inflation factors The variance inflation factor for the kth predictor is:
Variance inflation factors (VIFk) • A measure of how much the variance of the estimated regression coefficient bk is “inflated” by the existence of correlation among the predictor variables in the model. • VIFs exceeding 4 warrant investigation. • VIFs exceeding 10 are signs of serious multicollinearity.
Blood pressure example n = 20 hypertensive individuals p-1 = 6 predictor variables
Blood pressure example BP Age Weight BSA Duration Pulse Age 0.659 Weight 0.950 0.407 BSA 0.866 0.378 0.875 Duration 0.293 0.344 0.201 0.131 Pulse 0.721 0.619 0.659 0.465 0.402 Stress 0.164 0.368 0.034 0.018 0.312 0.506 Blood pressure (BP) is the response.
Regress y = BP on all 6 predictors Predictor Coef SE Coef T P VIF Constant -12.870 2.557 -5.03 0.000 Age 0.70326 0.04961 14.18 0.000 1.8 Weight 0.96992 0.06311 15.37 0.000 8.4 BSA 3.776 1.580 2.39 0.033 5.3 Dur 0.06838 0.04844 1.41 0.182 1.2 Pulse -0.08448 0.05161 -1.64 0.126 4.4 Stress 0.005572 0.003412 1.63 0.126 1.8 S = 0.4072 R-Sq = 99.6% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression 6 557.844 92.974 560.64 0.000 Residual Error 13 2.156 0.166 Total 19 560.000
Regress x2 = weight on 5 predictors Predictor Coef SE Coef T P VIF Constant 19.674 9.465 2.08 0.057 Age -0.1446 0.2065 -0.70 0.495 1.7 BSA 21.422 3.465 6.18 0.000 1.4 Dur 0.0087 0.2051 0.04 0.967 1.2 Pulse 0.5577 0.1599 3.49 0.004 2.4 Stress -0.02300 0.01308 -1.76 0.101 1.5 S = 1.725 R-Sq = 88.1% R-Sq(adj) = 83.9% Analysis of Variance Source DF SS MS F P Regression 5 308.839 61.768 20.77 0.000 Residual Error 14 41.639 2.974 Total 19 350.478
The variance inflation factor calculated by its definition The variance of the weight coefficient is inflated by a factor of 8.40 due to the existence of correlation among the predictor variables in the model.
The pairwise correlations BP Age Weight BSA Duration Pulse Age 0.659 Weight 0.950 0.407 BSA 0.866 0.378 0.875 Duration 0.293 0.344 0.201 0.131 Pulse 0.721 0.6190.659 0.465 0.402 Stress 0.164 0.368 0.034 0.018 0.312 0.506 Blood pressure (BP) is the response.
Regress y =BP on age, weight, duration and stress Predictor Coef SE Coef T P VIF Constant -15.870 3.195 -4.97 0.000 Age 0.68374 0.06120 11.17 0.000 1.5 Weight 1.03413 0.03267 31.65 0.000 1.2 Dur 0.03989 0.06449 0.62 0.545 1.2 Stress 0.002184 0.003794 0.58 0.573 1.2 S = 0.5505 R-Sq = 99.2% R-Sq(adj) = 99.0% Analysis of Variance Source DF SS MS F P Regression 4 555.45 138.86 458.28 0.000 Residual Error 15 4.55 0.30 Total 19 560.00
Data-based multicollinearity • Multicollinearity that results from a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which you collect the data.
Some methods • Modify the regression model by eliminating one or more predictor variables. • Collect additional data under different experimental or observational conditions.
(Modified!) Allen Cognitive Level (ACL) Study • Relationship of ACL test to level of pathology in a set of 23 patients in a hospital psychiatry unit: • Response y = ACL score • x1 = vocabulary (Vocab) score on Shipley Institute of Living Scale • x2 = abstraction (Abstract) score on Shipley Institute of Living Scale • x3 = score on Symbol-Digit Modalities Test (SDMT)
Strong correlation between Vocab and Abstract Pearson correlation of Vocab and Abstract = 0.990
Regress y = ACL on SDMT, Vocab, and Abstract Predictor Coef SE Coef T P VIF Constant 3.747 1.342 2.79 0.012 SDMT 0.02326 0.01273 1.83 0.083 1.7 Vocab 0.0283 0.1524 0.19 0.855 49.3 Abstract -0.0138 0.1006 -0.14 0.892 50.6 S = 0.7344 R-Sq = 26.5% R-Sq(adj) = 14.8% Analysis of Variance Source DF SS MS F P Regression 3 3.6854 1.2285 2.28 0.112 Residual Error 19 10.2476 0.5393 Total 22 13.9330
Plot after having collected more data Pearson correlation of Vocab and Abstract = 0.698
Regress y = ACL on SDMT, Vocab, and Abstract Predictor Coef SE Coef T P VIF Constant 3.9463 0.3381 11.67 0.000 SDMT 0.027404 0.007168 3.82 0.000 1.6 Vocab -0.01740 0.01808 -0.96 0.339 2.1 Abstract 0.01218 0.01159 1.05 0.297 2.2 S = 0.6878 R-Sq = 28.6% R-Sq(adj) = 25.3% Analysis of Variance Source DF SS MS F P Regression 3 12.3009 4.1003 8.67 0.000 Residual Error 65 30.7487 0.4731 Total 68 43.0496
Reducing structural multicollinearity In context of polynomial regression models
Structural multicollinearity • Multicollinearity that is a mathematical artifact caused by creating new predictors from other predictors, such as, creating the predictor x2 from the predictor x.
Example • (General research question) What is impact of exercise on human immune system? • (Specific research question) How is amount of immunoglobin in blood (y) related to maximal oxygen uptake (x)?
A quadratic polynomial regression function • where: • yi = amount of immunoglobin in blood (mg) • xi = maximal oxygen uptake (ml/kg) • typical assumptions about error terms (“INE”)
Interpretation of the regression coefficients • If 0 is a possible x value, then b0 is the predicted response. Otherwise, interpretation of b0 is meaningless. • b1 is the slope of the tangent line at x = 0. • b2 indicates the up/down direction of curve • b2 < 0 means curve is concave down • b2 > 0 means curve is concave up
Regress y = iggon oxygen and oxygen2 The regression equation is igg = - 1464 + 88.3 oxygen - 0.536 oxygensq Predictor Coef SE Coef T P VIF Constant -1464.4 411.4 -3.56 0.001 oxygen 88.31 16.47 5.36 0.000 99.9 oxygensq -0.5362 0.1582 -3.39 0.002 99.9 S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3% Analysis of Variance Source DF SS MS F P Regression 2 4602211 2301105 203.16 0.000 Residual Error 27 305818 11327 Total 29 4908029
Structural multicollinearity Pearson correlation of oxygen and oxygensq = 0.995
“Center” the predictors Mean of oxygen = 50.637 oxygen oxcent oxcentsq 34.6 -16.037 257.185 45.0 -5.637 31.776 62.3 11.663 136.026 58.9 8.263 68.277 42.5 -8.137 66.211 44.3 -6.337 40.158 67.9 17.263 298.011 58.5 7.863 61.827 35.6 -15.037 226.111 49.6 -1.037 1.075 33.0 -17.637 311.064
Wow! It really works! Pearson correlation of oxcent and oxcentsq = 0.219
where denotes the centered predictor A better quadratic polynomial regression function • and: • yi = amount of immunoglobin in blood (mg) • typical assumptions about error terms (“INE”)
Regress y = iggon oxcent and oxcent2 The regression equation is igg = 1632 + 34.0 oxcent - 0.536 oxcentsq Predictor Coef SE Coef T P VIF Constant 1632.20 29.35 55.61 0.000 oxcent 34.000 1.689 20.13 0.000 1.1 oxcentsq -0.5362 0.1582 -3.39 0.002 1.1 S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3% Analysis of Variance Source DF SS MS F P Regression 2 4602211 2301105 203.16 0.000 Residual Error 27 305818 11327 Total 29 4908029
Interpretation of the regression coefficients • b0 is predicted response at the predictor mean. • b1 is the estimated slope of the tangent line at the predictor mean; and, often, similar to the estimated slope in the simple model. • b2 indicates the up/down direction of curve • b2 < 0 means curve is concave down • b2 > 0 means curve is concave up
Similar estimates of coefficients from first-order linear model
The relationship between the two forms of the model Centered model: Original model: where:
Model use: What is predicted IgG if maximal oxygen uptake is 90? Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7) XX X denotes a row with X values away from the center XX denotes a row with very extreme X values Values of Predictors for New Observations New Obs oxcent oxcentsq 1 39.4 1549 There is an even greater danger in extrapolation when modeling data with a polynomial function, because of changes in direction.
The hierarchical approachto model fitting Widely accepted approach is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate. Is a first-order linear model (“line”) adequate?
and not this one: The hierarchical approachto model fitting But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained. That is, if a quadratic term was significant, you would use this regression function: