1 / 26

MBP1010H – Lecture 4: March 26,2012

MBP1010H – Lecture 4: March 26,2012. 1. Multiple regression Survival analysis. Multifactorial Analyses – chapter posted in Resources. Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11. Simple Linear Regression. to assess the linear relationship between

Download Presentation

MBP1010H – Lecture 4: March 26,2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MBP1010H – Lecture 4: March 26,2012 • 1. Multiple regression • Survival analysis Multifactorial Analyses – chapter posted in Resources Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11

  2. Simple Linear Regression • to assess the linear relationship between • 2 variables • predict the response (y) based on a change in x

  3. Multiple Linear Regression • explore relationships among multiple • variables to find out which x variables are • associated with the response (y) • devise an equation to predict y from several • x variables • adjust for potential confounding (lurking) variables • - effect of one particular x variable after adjusting for • differences in other x variables

  4. Confounding/ Causation Lurking Variable (z) Association Causation

  5. Simple Linear Regression Model observed y intercept slope residual DATA = FIT + RESIDUALS where the eiare independent and normally distributed N(0,s).

  6. Multiple Regression Statistical model for n sample data (i = 1, 2, … n) and p explanatory variables : Data = fit + residual yi = (0 + 1x1i … + pxpi) + εi Where the ei are independent and normally distributed N(0, σ).

  7. Analysis of Variance (ANOVA) table for linear regression + = SS Total = SS model + SS error SS = sum of squares Data = fit + residual yi = (0 + 1x1i …+ pxpi) + εi

  8. Source Sum of squares SS DF Mean square MS F P-value Model p SSM/DFM MSM/MSE Tail area above F Error n −p -1 SSE/DFE Total n− 1 SST/DFT ANOVA Table (p = number of explanatory variables)

  9. ^ Note: b1 or β can be used for sample in notation In the sample: ŷi = b0 + b1x1i … + bkxpi - least-squares regression method minimizes the sum of squared deviations ei (= yi – ŷi) to express y as a linear function of the p explanatory variables - regression coefficients (b1,…bp) reflect the unique association of each independent variable with the y variable. - analogous to the slope in simple regression.

  10. Case Study of Multiple Regression Goal: to predict success in early university years. Measure of Success: GPA after 3 semesters

  11. What factors are associated with GPA during first year of college? Data on 224 first-year computer science majors at a large university in a given year. The data for each student include: * Cumulative GPA (y, response variable) * Average high school grade in math (HSM, x1, explanatory variable) * Average high school grade in science (HSS, x2, explanatory variable) * Average high school grade in English (HSE, x3, explanatory variable) * SAT math score (SATM, x4, explanatory variable) * SAT verbal score (SATV, x5, explanatory variable)

  12. Summary statistics for the data (from SAS software)

  13. Univariate Associations between Variables - should do plots of associations – linearity and outliers

  14. ANOVA table for model with HSM, HSS and HSE F test - highly significant  at least one of the regression coefficients is significantly different from zero. R2 : HSM, HSS and HEE explain 20% of variation in GPA.

  15. ANOVA F-test for multiple regression H0: 1 = 2 = … = p= 0 versus Ha: at least one   0 F statistic: F = MSM / MSE A significant p-value means that at least one explanatory variable has a significant influence on y.

  16. ANOVA table for model with HSM, HSS and HSE F test - highly significant  at least one of the regression coefficients is significantly different from zero. R2 : HSM, HSS and HEE explain about 20% of variation in GPA.

  17. R Square and Adjusted R Square -adjusted R-square equal or smaller than regular R square • adjusts for a bias in R-square • regular R square tends to be an overestimate, especially when many predictors and small sample size • statisticians and researcher differ on whether to use adjusted R-square • adjusted R-square not often used or reported

  18. Multiple linear regression using HS grade averages: When all 3 high school averages are used together in the multiple regression analysis, only HSM contributes significantly to our ability to predict GPA.

  19. Drop the least significant variable from the previous model: HSS. Conclusions are about the same - but actual regression coefficients have changed.

  20. SATM and SATV

  21. Multiple linear regression with the two SAT scores only. ANOVA test very significant  at least one slope is not zero. R2 is really small (0.06)  only 6% of GPA variations are explained by these tests.

  22. Multiple regression model with all the variables together P-value very significant R2 fairly small (21%) HSM significant The overall test is significant, but only the average high school math score (HSM) makes a significant contribution in this model to predicting the cumulative GPA.

  23. Next Steps: - refine the model - drop out non significant variables check residuals - histogram or Q-Q plot of residuals - plot residuals against predicted GPA - plot residuals against explanatory variables

  24. Assumptions for Linear Regression • The relationship is between x and y is linear. • Equal variance of y for all values of x. • Residuals are approximately normally distributed. • The observations are independent.

  25. Residuals are randomly scattered  good! Curved pattern  the relationship is not linear. Change in variability across plotvariance not equal for all values of x. (transform) (transform y)

  26. Do x and y need to have normal distributions? • Regression: • y (probably) doesn’t matter • - x doesn’t matter • BUT: check for errors/outliers – could be influential • In practice, most analysts prefer y to be • reasonably normal • Residuals from the model should be • normally distributed

More Related