1 / 16

28. Multiple regression

28. Multiple regression. The Practice of Statistics in the Life Sciences Second Edition. Objectives (PSLS Chapter 28). Multiple regression The multiple linear regression model Indicator variables Two parallel regression lines Interaction Inference for multiple linear regression.

catori
Download Presentation

28. Multiple regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition

  2. Objectives (PSLS Chapter 28) Multiple regression • The multiple linear regression model • Indicator variables • Two parallel regression lines • Interaction • Inference for multiple linear regression

  3. The multiple linear regression model • In previous chapters we examined a simple linear regression model expressing a response variable y as a linear function of one explanatory variable x. In the population, this model has the form y = α + bx • We now examine multiple linear regression models in which the response variable y is a linear combination of k explanatory variables. In the population, this model takes the form y = b0 + b1 x1+ b2 x2+ … + bk xk • The parameters can be estimated from sample data, giving y = b0 + b1 x1+ b2 x2+ … + bk xk

  4. Assumptions • The mean response μy has a linear relationship with the k explanatory variables taken together. • The y responses are independent of each other. • For any set of fixed values of the k explanatory variables, the response y varies Normally. • The standard deviation σ of y is the same for all values of the explanatory variables. In inference, the value of σ is unknown.

  5. Indicator variables The multiple regression model can accommodate categorical response variables by coding them in a binary mode (0,1). In particular, we can compare individuals from different groups (independent SRSs in an observational study or randomized groups in an experiment) by using an indicator variable. To compare 2 groups, we simply create an indicator variable Ind such that • Ind = 0 for individuals in one group and • Ind = 1 for individuals in the other group

  6. Two parallel regression lines When plotting the linear regression pattern of y as a function of x for two groups, we sometimes find that the two groups have roughly parallel simple regression lines. In such instances, we can model the data using a unique multiple linear regression model with two parallel regression lines, using the quantitative variable x1 and an indicator variable Indx2for the groups: y = b0 + b1 x1+ b2 Indx2 b1is the slope for both lines b0is the intercept for the Indx2= 0 line (b0 + b2) is the intercept for the Indx2= 1 line Indx2= 1 line b2 Indx2= 0 line

  7. Male fruit flies were randomly assigned to either reproduce (IndReprod = 1) or not (IndReprod = 0). Their thorax length and longevity were recorded. Unique multiple regression model with an indicator variable for two parallel lines: y = –44.29 + 133.39x1 –23.55Indx2 Two separate simple linear regression models (notice the similar slopes).

  8. Interaction When plotting the linear regression pattern of y as a function of x for two groups, we may find two non-parallel simple regression lines. We can model such data with a unique multiple linear regression model using a quantitative variable x1, an indicator variable Indx2 for the groups, and an interaction term x1Indx2: y = b0 + b1 x1+ b2 Indx2+ b3 x1Indx2 Each line has its own slope and intercept. b1is the slope for the Indx2= 0 line (b0 + b3) is the slope for the Indx2= 1 line Indx2= 1 line Indx2= 0 line

  9. Note that an interaction term can be computed between any two variables (not just between a quantitative variable and an indicator variable). An interaction effect between the variables x1and x2means that the relationship between the mean response y and the explanatory variable x1 is different for varying values of the explanatory variable x2. When comparing two groups (x2 is an indicator variable), this means that the two regression lines will not be parallel.

  10. A random sample of children was taken and their lung capacity (forced expiratory volume, or FEV) was plotted as a function of their age and sex (IndSex = 0 for female and IndReprod = 1 for male).

  11. Using an interaction term to take into account the non-parallel lines, software gives the following multiple regression model : y = 0.6739 + 0.18209x1 –0.7314Indx2 + 0.10613x1Indx2

  12. Inference for multiple regression • We first want to run an overall test. We use an ANOVA F test to test: H0:β1 = 0andβ2 = 0 … andβk = 0 Ha: H0is not true (at least one coefficient is not equal to 0) • The squared multiple correlation coefficient R2 is given by the ANOVA output as and indicates how much of the variability in the response variable y can be explained by the specific model tested. A higher R2 indicates a better model.

  13. Estimating the regression coefficients • If the ANOVA is significant, we can run individual t tests on each regression coefficient: H0:βi = 0 in this specific model Ha:βi ≠ 0 in this specific model using , which follows the t distribution with n – k – 1 degrees of freedom when H0 is true. • We can also compute individual level-C confidence intervals for each of the k regression coefficients in the specific model. where t* is the critical value for a t distribution with n – k – 1 degrees of freedom.

  14. The ANOVA test is significant, indicating that at least one regression coefficient is not zero. R2 = 0.81, so this is a very good model that explains 81% of the variations in longevity of male fruit flies in the lab. • The individual t tests are all significant, indicating that in this model, the regression coefficients are significantly different from zero. The confidence intervals give an range of likely values for these parameters. • Because this is a model with 2 parallel lines, we can conclude that reproducing male fruit flies live between 19 and 28 days less on average than those that do not reproduce, when thorax length is taken into account. SPSS

  15. The ANOVA test is significant, indicating that at least one regression coefficient is not zero. R2 = 0.67, so this is a good model that explains 67% of FEV variations in children. The individual t tests are all significant, indicating that in this model, the regression coefficients are significantly different from zero. • Because this is a model with a significant interaction effect, we conclude that both age and sex influence FEV in children, but that the effect of age on FEV is different for males and for females. The scatterplot indicates that the effect of age is more pronounced for males.

  16. Checking the conditions for inference • The best way to check the conditions for inference is by examining graphically the scatterplot(s) of y as a function of each xi, and the residuals (y - ŷ) from the multiple regression model. • Look for: • Linear trends in the scatterplot(s) • Normality of the residuals (histogram of residuals) • Constant σ for all combinations of the xis (residual plot with no particular pattern and approximately equal vertical spread) • Independence of observations (check the study design or a plot of the residuals sorted by order of data acquisition)

More Related