220 likes | 242 Views
Lab 4. Multiple Linear Regression. Meaning. An extension of simple linear regression It models the mean of a response variable as a linear function of several explanatory variables. Ways of analysis. Matrix of scatterplots Matrix of correlations Regression:
E N D
Lab 4 Multiple Linear Regression
Meaning • An extension of simple linear regression • It models the mean of a response variable as a linear function of several explanatory variables
Ways of analysis • Matrix of scatterplots • Matrix of correlations • Regression: fit the model (variable selection); interpret the model, t-test & f-test in regression; prediction; diagnostics (linearity, constant var, normality, independence, outliers) .
The independent variable, the response • The response: iq • The independent variables: • MILK: 0=no breast milk, 1=yes • FEM: 0=male kid, 1=female • WEEKS: weeks in ventilation • SOCIAL: mum’s social class • 1,2,3,4 with 1 being the highest • RANK: birth order of the kid • EDUC: mum’s education level • 1,2,3,4,5 with 5 being the highest
Regression-fit the model • Procedure • Analyze Regression Linear • Methods of determining independent variables
Methods (details in instruction 4 P18) • Enter: The model is obtained with all specified variables. This is the default method. • Stepwise • Remove • Backward: The variables are removed from the model one by one if the meet the criterion for removal (a maximum significance level or a minimum F value). • Forward:
Regression-interpret model • Interpretation of the output 1. variables entered/removed 2. model summaries (R, R^2) 3. ANOVA test (f-test)
Note on f-test • To test overall significance of the model • its null distribution: f-distribution • To further construct extra-sum-of-squares f-test
4. Coefficients (estimation, t-test, CI of coefficients) • t-test in i-th row • CI of coefficients
Note on t-test and CI of coefficients • t-test • to test the significance of a single independent variable • can be one-sided • its null distribution: t-distribution • 95% CI of coefficients • estimation of the range of its coefficient with 95% confidence • i.e. the 95% changing range of Y with 1 unit increase in its corresponding X
Regression-prediction • Point estimation • Confidence interval of the mean (CI) • Prediction interval of one observation (PI) • e.g.
Multiple Regression-Diagnostics Obtain plots to test the validity of the assumptions Linearity: Residuals vs predicted value (Y) / explanatory variable (X) Constant variance: Residuals vs predicted value (Y) / explanatory variable (X) Normality: QQ plot of residuals Independence: residuals versus the time order of the observations Outliers and influential observations:
What is an influential observation? • An observation is influential if removing it markedly changes the estimated coefficients of the regression model. • An outlier may be an influential observation.
To identify outliers and/or influential observations • Studentized Residuals A case may be considered an outlier if the absolute value of its studentized residual exceeds 2. • Leverage Values The leverage for an observation is larger than 2p/n would imply the observation has a high potential for influence. • Cook’s Distances If Cook’s distance is close to or larger than 1, the case may be considered influential.
Miscellanies • Multicollinearity • it exists if the correlation between independent variables is close to or higher than 0.85 • Remember to use Ln(WEEKS) from Question 5
Miscellanies • Understanding meaning of 95% CI of coefficients • Identify “full model” and “reduced model” when doing extra-sum-of-squares f-test