1 / 32

Regression: (2) Multiple Linear Regression and Path Analysis

Regression: (2) Multiple Linear Regression and Path Analysis. Hal Whitehead BIOL4062/5062. Multiple Linear Regression and Path Analysis. Multiple linear regression assumptions parameter estimation hypothesis tests selecting independent variables collinearity polynomial regression

lolita
Download Presentation

Regression: (2) Multiple Linear Regression and Path Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression:(2) Multiple Linear Regression and Path Analysis Hal Whitehead BIOL4062/5062

  2. Multiple Linear Regression and Path Analysis • Multiple linear regression • assumptions • parameter estimation • hypothesis tests • selecting independent variables • collinearity • polynomial regression • Path analysis

  3. Regression One Dependent Variable Y Independent Variables X1,X2,X3,...

  4. Purposes of Regression 1. Relationship between Y and X's 2. Quantitative prediction of Y 3. Relationship between Y and X controlling for C 4. Which of X's are most important? 5. Best mathematical model 6. Compare regression relationships: Y1 on X, Y2 on X 7. Assess interactive effects of X's

  5. Simple regression: one X • Multiple regression: two or more X's Y = ß0 + ß1X(1) + ß2X(2) + ß3X(3) + ... + ßkX(k) + E

  6. Multiple linear regression:assumptions (1) • For any specific combination of X's, Y is a (univariate) random variable with a certain probability distribution having finite mean and variance (Existence) • Y values are statistically independent of one another (Independence) • Mean value of Y given the X's is a straight linear function of the X's (Linearity)

  7. Multiple linear regression:assumptions (2) • The variance of Y is the same for any fixed combinations of X's (Homoscedasticity) • For any fixed combination ofX's, Y has a normal distribution (Normality) • There are no measurement errors in the X's (Xs measured without error)

  8. Multiple linear regression:parameter estimation Y = ß0 + ß1X(1) + ß2X(2) + ß3X(3) + ... + ßkX(k) + E • Estimate the ß's in multiple regression using least squares • Sizes of the coefficients not good indicators of importance of X variables • Number of data points in multiple regression • at least one more than number of X’s • preferably 5 times number of X’s

  9. Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004) Multiple regression of Y [Log (CNS)] on: X’ s ß SE(ß) Log(Mass) -0.49 (0.70) Log(Fat) -0.07 (0.10) Log(Muscle) 1.03 (0.54) Log(Heart) 0.42 (0.22) Log(Bone) -0.07 (0.30) N=39

  10. Multiple linear regression:hypothesis tests Usually test: H0: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßj⋅X(j) + E H1: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßj⋅X(j) + ... + ßk⋅X(k) + E F-test with k-j, n-(k-j)-1 degrees of freedom (“partial F-test”) H0: variables X(j+1),…,X(k) do not help explain variability in Y

  11. Multiple linear regression:hypothesis tests e.g. Test significance of overall multiple regression H0: Y = ß0 + E H1: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßk⋅X(k) + E • Test significance of • adding independent variable • deleting independent variable

  12. Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004) Multiple regression of Y [Log (CNS)] on: X’ s ß SE(ß) P Log(Mass) -0.49 (0.70) 0.49 Log(Fat) -0.07 (0.10) 0.52 Log(Muscle) 1.03 (0.54) 0.07 Log(Heart) 0.42 (0.22) 0.06 Log(Bone) -0.07 (0.30) 0.83 Tests whether removal of variable reduces fit

  13. Multiple linear regression:selecting independent variables • Reasons for selecting a subset of independent variables (X’s): • cost (financial and other) • simplicity • improved prediction • improved explanation

  14. Multiple linear regression:selecting independent variables • Partial F-test • predetermined forward selection • forward selection based upon improvement in fit • backward selection based upon improvement in fit • stepwise (backward/forward) • Mallow’s C(p) • AIC

  15. Multiple linear regression:selecting independent variables • Partial F-test • predetermined forward selection • Mass, Bone, Heart, Muscle, Fat • forward selection based upon improvement in fit • backward selection based upon improvement in fit • Stepwise (backward/forward)

  16. Multiple linear regression:selecting independent variables • Partial F-test • predetermined forward selection • forward selection based upon improvement in fit • backward selection based upon improvement in fit • stepwise (backward/forward)

  17. Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004) • Complete model (r2=0.97): • Forward stepwise (α-to-enter=0.15; α-to-remove=0.15): • 1. Constant (r2=0.00) • 2. Constant + Muscle (r2=0.97) • 3. Constant + Muscle + Heart (r2=0.97) • 4. Constant + Muscle + Heart + Mass (r2=0.97) -0.18 - 0.82xMass +1.24xMuscle + 0.39xHeart

  18. Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004) • Complete model (r2=0.97): • Backward stepwise (α-to-enter=0.15; α-to-remove=0.15): • 1. All (r2=0.97) • 2. Remove Bone (r2=0.97) • 3. Remove Fat (r2=0.97) -0.18 - 0.82xMass +1.24xMuscle + 0.39xHeart

  19. Comparing models • Mallow’s C(p) • C(p) = (k-p).F(p) + (2p-k+1) • k parameters in full model; p parameters in restricted model • F(p) is the F value comparing the fit of the restricted model with that of the full model • Lowest C(p) is best model • Akaike Information Criteria (AIC) • AIC=n.Log(σ2) +2p • Lowest AIC indicates best model • Can compare models not included in one another

  20. Comparing models

  21. Collinearity • If two (or more) X’s are linearly related: • they are collinear • the regression problem is indeterminate X(3)=5.X(2)+16, or X(2)=4.X(1)+ 16.X(4) • If they are nearly linearly related (near collinearity), coefficients and tests are very inaccurate

  22. What to do about collinearity? • Centering (mean = 0) • Scaling (SD =1) • Regression on first few Principal Components • Ridge Regression

  23. Curvilinear (Polynomial) Regression • Y = ß0 + ß1⋅X + ß2⋅X² + ß3⋅X3 + ... + ßk⋅Xk + E • Used to fit fairly complex curves to data • ß’s estimated using least squares • Use sequential partial F-tests, or AIC, to find how many terms to use • k>3 is rare in biology • Better to transform data and use simple linear regression, when possible

  24. Curvilinear (Polynomial) Regression Y=0.066 + 0.00727.X Y=0.117 + 0.00085.X + 0.00009.X² Y=0.201 - 0.01371.X + 0.00061.X² - 0.000005.X3 From Sokal and Rohlf

  25. Path Analysis

  26. A B C D E Path Analysis • Models with causal structure • Represented by path diagram • All variables quantitative • All path relationships assumed linear • (transformations may help)

  27. A B C D E U Path Analysis • All paths one way • A => C • C => A • No loops • Some variables may not be directly observed: • residual variables (U) • Some variables not observed but known to exist • latent variables (D)

  28. A B C D E U Path Analysis • Path coefficients and other statistics calculated using multiple regressions • Variables are: • centered (mean = 0) so no constants in regressions • often standardized (SD = 1) • So: path coefficients usually between -1 and +1 • Paths with coefficients not significantly different from zero may be eliminated

  29. Path Analysis: an example • Isaak and Hubert. 2001. “Production of stream habitat gradients by montane watersheds: hypothesis tests based on spatially explicit path analyses” Can. J. Fish. Aquat. Sci.

  30. - - - Predicted negative interaction ________ Predicted positive interaction

More Related