260 likes | 336 Views
MBP1010H – Lecture 4: March 26,2012. 1. Multiple regression Survival analysis. Multifactorial Analyses – chapter posted in Resources. Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11. Simple Linear Regression. to assess the linear relationship between
E N D
MBP1010H – Lecture 4: March 26,2012 • 1. Multiple regression • Survival analysis Multifactorial Analyses – chapter posted in Resources Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11
Simple Linear Regression • to assess the linear relationship between • 2 variables • predict the response (y) based on a change in x
Multiple Linear Regression • explore relationships among multiple • variables to find out which x variables are • associated with the response (y) • devise an equation to predict y from several • x variables • adjust for potential confounding (lurking) variables • - effect of one particular x variable after adjusting for • differences in other x variables
Confounding/ Causation Lurking Variable (z) Association Causation
Simple Linear Regression Model observed y intercept slope residual DATA = FIT + RESIDUALS where the eiare independent and normally distributed N(0,s).
Multiple Regression Statistical model for n sample data (i = 1, 2, … n) and p explanatory variables : Data = fit + residual yi = (0 + 1x1i … + pxpi) + εi Where the ei are independent and normally distributed N(0, σ).
Analysis of Variance (ANOVA) table for linear regression + = SS Total = SS model + SS error SS = sum of squares Data = fit + residual yi = (0 + 1x1i …+ pxpi) + εi
Source Sum of squares SS DF Mean square MS F P-value Model p SSM/DFM MSM/MSE Tail area above F Error n −p -1 SSE/DFE Total n− 1 SST/DFT ANOVA Table (p = number of explanatory variables)
^ Note: b1 or β can be used for sample in notation In the sample: ŷi = b0 + b1x1i … + bkxpi - least-squares regression method minimizes the sum of squared deviations ei (= yi – ŷi) to express y as a linear function of the p explanatory variables - regression coefficients (b1,…bp) reflect the unique association of each independent variable with the y variable. - analogous to the slope in simple regression.
Case Study of Multiple Regression Goal: to predict success in early university years. Measure of Success: GPA after 3 semesters
What factors are associated with GPA during first year of college? Data on 224 first-year computer science majors at a large university in a given year. The data for each student include: * Cumulative GPA (y, response variable) * Average high school grade in math (HSM, x1, explanatory variable) * Average high school grade in science (HSS, x2, explanatory variable) * Average high school grade in English (HSE, x3, explanatory variable) * SAT math score (SATM, x4, explanatory variable) * SAT verbal score (SATV, x5, explanatory variable)
Summary statistics for the data (from SAS software)
Univariate Associations between Variables - should do plots of associations – linearity and outliers
ANOVA table for model with HSM, HSS and HSE F test - highly significant at least one of the regression coefficients is significantly different from zero. R2 : HSM, HSS and HEE explain 20% of variation in GPA.
ANOVA F-test for multiple regression H0: 1 = 2 = … = p= 0 versus Ha: at least one 0 F statistic: F = MSM / MSE A significant p-value means that at least one explanatory variable has a significant influence on y.
ANOVA table for model with HSM, HSS and HSE F test - highly significant at least one of the regression coefficients is significantly different from zero. R2 : HSM, HSS and HEE explain about 20% of variation in GPA.
R Square and Adjusted R Square -adjusted R-square equal or smaller than regular R square • adjusts for a bias in R-square • regular R square tends to be an overestimate, especially when many predictors and small sample size • statisticians and researcher differ on whether to use adjusted R-square • adjusted R-square not often used or reported
Multiple linear regression using HS grade averages: When all 3 high school averages are used together in the multiple regression analysis, only HSM contributes significantly to our ability to predict GPA.
Drop the least significant variable from the previous model: HSS. Conclusions are about the same - but actual regression coefficients have changed.
Multiple linear regression with the two SAT scores only. ANOVA test very significant at least one slope is not zero. R2 is really small (0.06) only 6% of GPA variations are explained by these tests.
Multiple regression model with all the variables together P-value very significant R2 fairly small (21%) HSM significant The overall test is significant, but only the average high school math score (HSM) makes a significant contribution in this model to predicting the cumulative GPA.
Next Steps: - refine the model - drop out non significant variables check residuals - histogram or Q-Q plot of residuals - plot residuals against predicted GPA - plot residuals against explanatory variables
Assumptions for Linear Regression • The relationship is between x and y is linear. • Equal variance of y for all values of x. • Residuals are approximately normally distributed. • The observations are independent.
Residuals are randomly scattered good! Curved pattern the relationship is not linear. Change in variability across plotvariance not equal for all values of x. (transform) (transform y)
Do x and y need to have normal distributions? • Regression: • y (probably) doesn’t matter • - x doesn’t matter • BUT: check for errors/outliers – could be influential • In practice, most analysts prefer y to be • reasonably normal • Residuals from the model should be • normally distributed