700 likes | 1.43k Views
Lecture 13: Multiple linear regression. When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression. Some GLM procedures.
E N D
Lecture 13: Multiple linear regression • When and why we use it • The general multiple regression model • Hypothesis testing in multiple regression • The problem of multicollinearity • Multiple regression procedures • Polynomial regression • Power analysis in multiple regression Bio 4118 Applied Biostatistics
Some GLM procedures *either categorical or treated as a categorical variable Bio 4118 Applied Biostatistics
Log Production Log [P] Log Production Log [P] Log [Zoo] When do we use multiple regression? • to compare the relationship between a continuous dependent (Y) variable and several continuous independent (X1, X2,…) variables • e.g. relationship between lake primary production, phosphorous concentration and zooplankton abundance Bio 4118 Applied Biostatistics
Y ^ Y, X1, X2 Y, X1, X2 eY X , X . 1 2 X1 X2 X2 X1 The multiple regression model: general form • The general model is: which defines a k-dimensional plane, where a = intercept, bj = partial regression coefficient of Y on Xj, Xijis value of ith observation of dependent variable Xj, and ei is the residual of the ith observation. Bio 4118 Applied Biostatistics
8 X2 = -3 4 X2 = -1 X2 = 1 Y 0 -4 X2 = 3 Simple (pooled) regression -8 -4 -2 0 2 4 Partial regression X1 What is the partial regression coefficient anyway? • bj is the rate of change in Y per change in Xj with all other variables held constant; this is not the slope of the regression of Y on Xj, pooled over all other variables! Bio 4118 Applied Biostatistics
4 Y 2 bj = 2 0 1 2 4 Y 2 bj = .02 0 200 100 Xj The effect of scale • Two independent variables on different scales will have different slopes, even if the proportional change in Y is the same. • So, if we want to measure the relative strength of the influence of each variable on Y, we must eliminate the effect of different scales. Bio 4118 Applied Biostatistics
The multiple regression model: standardized form • Since bjdepends on the size of Xj, to examine the relative effect of each independent variable we must standardize the regression coefficients by first transforming all variables and fitting the regression model based on the transformed variables. • The standardized coefficients bj* estimate the relative strength of the influence of variable Xj on Y. Bio 4118 Applied Biostatistics
Regression coefficients: summary • Partial regression coefficient: equals the slope of the regression of Y on Xj when all other independent variables are held constant. • Standardized partial regression coefficient:the rate of change of Y in standard deviation units per one standard deviation of Xjwith all other independent variables held constant. Bio 4118 Applied Biostatistics
Assumptions • independence of residuals • homoscedasticity of residuals • linearity (Y on all X) • no error on independent variables • normality of residuals Bio 4118 Applied Biostatistics
Y = + Hypothesis testing in simple linear regression: partitioning the total sums of squares Total SS Unexplained (Error) SS Model (Explained) SS Bio 4118 Applied Biostatistics
Y X1 Total SS X2 Model SS Residual SS Hypothesis testing in multiple regression I: partitioning the total sums of squares • Partition total sums of squares into model and residual SS: Bio 4118 Applied Biostatistics
Hypothesis testing I: partitioning the total sums of squares • So, MSmodel = s2Y and MSerror= 0 if observed = expected for all i. • Calculate F = MSmodel/MSerrorand compare with F distribution with 1 and N - 2 df. • H0: F = 1 Bio 4118 Applied Biostatistics
X2= 1 X2= 2 Y Y H01: b1 = 0, rejected X1, X2 fixed X1= 2 Y Y X1= 3 H02: b2= 0, accepted X2, X1 fixed Hypothesis testing II: testing individual partial regression coefficients • Test each hypothesis by a t-test: • Note: these are 2-tailed hypotheses! Bio 4118 Applied Biostatistics
X1 colinear X2 X2 independent Covariance Variance X3 Multicollinearity • Independent variables are correlated, and therefore, not independent: evaluate by looking at covariance or correlation matrix. Bio 4118 Applied Biostatistics
Multicollinearity: problems • If two independent variables X1 and X2 are uncorrelated, then the model sums of squares for a linear model with both included equals the sum of the SSmodelfor each considered separately. • But if they are correlated, the former will be less than the latter. • So, the real question is: given a model with X1 included, how much does SSmodel increase when X2 is also included (or vice versa)? Bio 4118 Applied Biostatistics
Multicollinearity: consequences • inflated standard errors for regression coefficients • sensitivity of parameter estimates to small changes in data • But, estimates of partial regression coefficients remain unbiased. • One or more independent variables may not appear in the final regression model not because they do not covary with Y, but because they covary with another X. Bio 4118 Applied Biostatistics
Detecting multicollinearity • high R2 but few or no significant t-tests for individual independent variables • high pairwise correlations between X’s • high partial correlations among regressors (independent variables are a linear combination of others) • Eigenvalues, condition index, tolerance and variance inflation factors Bio 4118 Applied Biostatistics
E1 X1 E2 X2 l2 E2 X1 E1 l1 X2 Quantifying the effect of multicollinearity • Eigenvectors: a set of “lines” E1, E2,…, Ek in a k-dimensional space which are orthogonal to each other • Eigenvalue: the magnitude (length) l of the corresponding eigenvector Bio 4118 Applied Biostatistics
Low correlationl1 = l2 X1 X2 High correlationl1 >> l2 X1 X2 Quantifying the effect of multicollinearity • Eigenvalues: if all k eigenvalues are approximately equal, multicollinearity is low. • Condition index: sqrt(ll /ls); near 1 indicates low multicollinearity. • Tolerance: 1 - proportion of variance in each independent variable accounted for by all other independent variables: near 1 indicates low multicollinearity. Bio 4118 Applied Biostatistics
Remedial measures • Get more data to reduce correlations. • Drop some variables. • Use principal component or ridge regression, which yield biased estimates but with smaller standard errors. Bio 4118 Applied Biostatistics
Multiple regression: the general idea Model A (X1 in) • Evaluate significance of a variable by fitting two models: one with the term in, the other with it removed. • Test for change in model fit (D MF) associated with removal of the term in question. • Unfortunately, D M F may depend on what other variables are in model if there is multicollinearity! D M F (e.g. D R2) Model B (X2 out) Retain X1 (D large) Delete X1 (D small) Bio 4118 Applied Biostatistics
Fitting multiple regression models • Goal: find the “best” model, given the available data. • Problem 1: what is “best”? • highest R2? • lowest RMS? • highest R2 but contains only individually significant independent variables? • maximizes R2 with minimum number of independent variables? Bio 4118 Applied Biostatistics
Selection of independent variables (cont’d) • Problem 2: even if “best” is defined, by what method do we find it? • Possibilities: • compute all possible models (2k -1) and choose the best one. • use some procedure for winnowing down the set of possible models. Bio 4118 Applied Biostatistics
{X1} {X2,X3} {X2} {X3} {X1,X3} Strategy I: computing all possible models • Compute all possible models and choose the “best” one. • cons: • time-consuming • leaves definition of “best” to researcher • pros: • if the “best” model is defined, you will find it! {X1, X2, X3} {X1,X2} {X1,X2,X3} Bio 4118 Applied Biostatistics
{X1, X2, X3} {X1,X2} {X1,X2,X3} {X2} {X2} {X1, X2} {X1,X2,X3} Strategy II: forward selection r2 > r1 > r3 • Start with variable that has highest (significant) R2, i.e. highest partial correlation coefficient r. • Add others one at a time until no further significant increase in R2 with bjs recomputed at each step. • problem: if Xj is included, it stays in even if it contributes little to the SSmodel once other variables are included. R2= R22 Final model R2= R212 R212 >R22 R212 = R22 R1232= R212 R1232> R212 Bio 4118 Applied Biostatistics
{X1, X2, X3, X4} {X2,X4} {X2,X1} {X2,X3} {X2} {X2,X3} {X2,X1} Forward selection: order of entry p to enter = .05 r2 > r1 > r3 >r4 • Begin with the variable with the highest partial correlation coefficient. • Next entry is that variable which gives largest increase in overall R2 by F-test of significance of increase, above some specified F-to-enter (below specified p to enter) value. p[F(X2)] = .001 p[F(X2, X4)] = .55 p[F(X2, X1)] = .002 p[F(X2, X3)] = .04 X4 eliminated ... Bio 4118 Applied Biostatistics
{X1, X2, X3} {X1,X3} {X1,X3} {X3} {X3} {X1,X2,X3} Strategy III: backward selection R2= R1232 Final model r2 < r1 < r3 • Start with all variables. • Drop variables whose removal does not significantly reduce R2, one at a time, starting with the one with the lowest partial correlation coefficient. • But, once Xj is dropped, it stays out even if it explains a significant amount of the remaining variability once other variables are excluded. R2= R132 R132 = R1232 R132 < R1232 R32 = R132 R32 < R132 Bio 4118 Applied Biostatistics
{X1, X2, X3, X4} {X2, X1, X3} {X2,X1} {X1,X3} {X2,X3} Backward selection: order of entry p to remove = .10 r2 > r1 > r3 >r4 • Begin with the variable with the smallest partial correlation coefficient. • Next removal is that variable which gives the smallest increase in overall R2 by F-test of significance of increase, below some specified F-to-remove (above specified p to remove)value. p[F(X2,X1, X3)] = .44 X4 removed X2, X3, X1still in p[F(X1,X3)] = .009 p[F(X2,X3)] = .001 p[F(X2,X1)] = .25 ... X3 removed X1 , X2 still in Bio 4118 Applied Biostatistics
{X1, X2, X3, X4} {X1,X4} {X2,X1} {X2,X4} {X2,X3} {X2} {X1,X2, X4} {X1,X2, X3} Strategy IV: stepwise selection p to enter = .10 p to remove = .05 r2 > r1 > r4 >r3 • Once a variable is included (removed), set of remaining variables is scanned for other variables that should now be deleted (included), including those added (removed) at earlier stages. • To avoid infinite loops, we usually set p to enter > p to remove. p[F(X2)] = .001 p[F(X2, X4)] = .03 p[F(X2, X1)] = .002 p[F(X2, X3)] = .09 p[F(X1, X2, X3)] = .19 p[F(X1, X2, X4)] = .02 Bio 4118 Applied Biostatistics
Example • log of herptile species richness (logherp) as a function of log wetland area (logarea), percentage of land within 1 km covered in forest (cpfor2) and density of hard-surface roads within 1 km (thtdens) Bio 4118 Applied Biostatistics
Example (all variables) DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740 SQUARED MULTIPLE R: 0.547 ADJUSTED SQUARED MULTIPLE R: .490 STANDARD ERROR OF ESTIMATE: 0.162 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032 Bio 4118 Applied Biostatistics
Example (cont’d) ANALYSIS OF VARIANCE SOURCE SS DF MS F-RATIO P REGRESSION 0.760 3 0.253 9.662 0.000 RESIDUAL 0.629 24 0.026 Bio 4118 Applied Biostatistics
Example: forward stepwise DEPENDENT VARIABLE LOGHERP MINIMUM TOLERANCE FOR ENTRY INTO MODEL = .010000 FORWARD STEPWISE WITH ALPHA-TO-ENTER= .10 AND ALPHA-TO-REMOVE= .05 STEP # 0 R= .000 RSQUARE= .000 VARIABLE COEFF. SE. STD COEF. TOL. F 'P' IN --- 1 CONSTANT OUT PART. CORR --- 2 LOGAREA 0.596 . . .1E+01 14.321 0.001 3 CPFOR2 0.305 . . .1E+01 2.662 0.115 4 THTDEN -0.496 . . .1E+01 8.502 0.007 Bio 4118 Applied Biostatistics
Forward stepwise (cont’d) STEP # 1 R= .596 RSQUARE= .355 TERM ENTERED: LOGAREA VARIABLE COEFF. SE. STD COEF. TOL. F 'P' IN --- 1 CONSTANT 2 LOGAREA 0.247 0.065 0.596 .1E+01 14.321 0.001 OUT PART. CORR --- 3 CPFOR2 0.382 . . 0.99 4.273 0.049 4 THTDEN -0.529 . . 0.98 9.725 0.005 Bio 4118 Applied Biostatistics
Forward stepwise (cont’d) STEP # 2 R= .732 RSQUARE= .536 TERM ENTERED: THTDEN VARIABLE COEFF. SE. STD COEF .TOL. F 'P' IN --- 1 CONSTANT 2 LOGAREA 0.225 0.057 0.542 0.98 15.581 0.001 4 THTDEN -0.042 0.013 -0.428 0.98 9.725 0.005 OUT PART. CORR --- 3 CPFOR2 0.156 . . 0.74380 0.599 0.447 Bio 4118 Applied Biostatistics
Forward stepwise: final model FORWARD STEPWISE: P TO INCLUDE = .15 DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.732 SQUARED MULTIPLE R: 0.536 ADJUSTED SQUARED MULTIPLE R: .490 STANDARD ERROR OF ESTIMATE: 0.161 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.376 0.149 0.000 . 2.521 0.018 LOGAREA 0.225 0.057 0.542 0.984 3.947 0.001 THTDEN -0.042 0.013 -0.428 0.984 -3.118 0.005 Bio 4118 Applied Biostatistics
Example: backward stepwise (final model) BACKWARD STEPWISE: P TO REMOVE = .15 DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.732 SQUARED MULTIPLE R: 0.536 ADJUSTED SQUARED MULTIPLE R: .499 STANDARD ERROR OF ESTIMATE: 0.161 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.376 0.149 0.000 . 2.521 0.018 LOGAREA 0.225 0.057 0.542 0.984 3.947 0.001 THTDEN -0.042 0.013 -0.428 0.984 -3.118 0.005 Bio 4118 Applied Biostatistics
Example: subset model DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.670 SQUARED MULTIPLE R: 0.449 ADJUSTED SQUARED MULTIPLE R: .405 STANDARD ERROR OF ESTIMATE: 0.175 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.027 0.167 0.000 . 0.162 0.872 LOGAREA 0.248 0.062 0.597 1.000 4.022 0.000 CPFOR2 0.003 0.001 0.307 1.000 2.067 0.049 Bio 4118 Applied Biostatistics
What if relationship between Y and one or more X’s is nonlinear? • Option 1: transform data. • Option 2: use non-linear regression. • Option 3: use polynomial regression. Bio 4118 Applied Biostatistics
1000 Black fly biomass (mgDM/m²) 100 Linear model 2nd order polynomial model 10 10 30 50 70 90 110 Current velocity (cm/s) The polynomial regression model • In polynomial regression, the regression model includes terms of increasingly higher powers of the dependent variable. Bio 4118 Applied Biostatistics
1000 Black fly biomass (mgDM/m²) 100 Linear model 2nd order polynomial model 10 10 30 50 70 90 110 Current velocity (cm/s) The polynomial regression model: procedure • Fit simple linear model. • Fit model with quadratic, test for increase in SSmodel . • Continue with higher order (cubic, quartic, etc.) until there is no further significant increase in SSmodel . • Include terms of order up to the power of (number of points of inflexion plus 1). Bio 4118 Applied Biostatistics
The biological significance of the higher order terms in a polynomial regression (if any) is generally not known. By definition, polynomial terms are strongly correlated; hence, standard errors will be large (precision is low), and increase with the order of the term. Extrapolation of polynomial models is always nonsense. Polynomial regression: caveats Y = a+ b1X1- b2X12 Y X1 Bio 4118 Applied Biostatistics
Power analysis in GLM (including MR) • In any GLM, hypotheses are tested by means of an F-test. • Remember: the appropriate SSerror and dferrordepends on the type of analysis and the hypothesis under investigation. • Knowing F,we can compute R2,the proportion of the total variance in Y explained by the factor (source) under consideration. Bio 4118 Applied Biostatistics
Partial and total R2 Proportion of variance accounted for by both A and B (R2Y•A,B) • The totalR2 (R2Y•B) is the proportion of variance in Y accounted for (explained by) a set of independent variables B. • The partialR2 (R2Y•A,B- R2Y•A ) is the proportion of variance in Y accounted for by B when the variance accounted for by another set A is removed. Proportion of variance accounted for by B independent of A (R2Y•A,B- R2Y•A ) (partial R2) Proportion of variance accounted for by A only (R2Y•A)(total R2) Bio 4118 Applied Biostatistics
Y A A B Partial and total R2 Proportion of variance independent of A (R2Y•A,B- R2Y•A ) (partial R2) Proportion of variance accounted for by B (R2Y•B)(total R2) • The totalR2 (R2Y•B) for set B equals the partialR2 (R2Y•A,B- R2Y•A ) with respect to set B if either (1) the total R2 for A (R2Y•A) is zero, or (2) if A and B are independent (in which case R2Y•A,B= R2Y•A + R2Y•B). Equal iff Bio 4118 Applied Biostatistics
Log Production Log [P] Log [Zoo] Partial and total R2 in multiple regression • Suppose we have three independent variables X1 ,X2 andX3 . Bio 4118 Applied Biostatistics
Defining effect size in multiple regression • The effect size, denoted f2 is given by the ratio of the factor (source) R2factor and the appropriate error R2error. • Note: both R2factor and R2error depend on the null hypothesis under investigation. Bio 4118 Applied Biostatistics
Case 1: a set B of variables {X1, X2, …} is related to Y, and the totalR2 (R2Y•B) is determined. The error variance proportion is then 1- R2Y•B . H0: R2Y•B = 0 Example: effect of wetland area, surrounding forest cover, and surrounding road densities on herptile species richness in southeastern Ontario wetlands B ={LOGAREA, CPFOR2,THTDEN } Defining effect size in multiple regression: case 1 Bio 4118 Applied Biostatistics
DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740 SQUARED MULTIPLE R: 0.547 ADJUSTED SQUARED MULTIPLE R: .490 STANDARD ERROR OF ESTIMATE: 0.162 VARIABLE COEFF. SE STD COEF. TOL. T P CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032 Bio 4118 Applied Biostatistics
Case 2: the proportion of variance of Y due to B over and above that due to A is determined (R2Y•A,B- R2Y•A ). The error variance proportion is then 1- R2Y•A,B . H0: R2Y•A,B- R2Y•A = 0 Example: herptile richness in southeastern Ontario wetlands B ={THTDEN}, A = {LOGAREA, CPFOR2},AB = {LOGAREA, CPFOR2, THTDEN} Defining effect size in multiple regression: case 2 Bio 4118 Applied Biostatistics