150 likes | 412 Views
Multiple Linear Regression. Regression Diagnostics. Find Scores That. Contribute to violation of assumptions. Are suspect because they are far removed from the centroid (multidimensional mean) Have undue influence on the solution. Outliers Among the Predictors.
E N D
Multiple Linear Regression Regression Diagnostics
Find Scores That • Contribute to violation of assumptions. • Are suspect because they are far removed from the centroid (multidimensional mean) • Have undue influence on the solution.
Outliers Among the Predictors • Leverage, hi or Hat Diagonal • The larger this statistic, the greater the distance between the data point and the centroid in p-dimensional space. • Investigate cases with hi greater than2(p-1)/N. • p is the number of parameters in the model, including the intercept.
Distance from the Regression Surface • Standardized Residual (aka Studentized Residual) • Difference between actual Y and predicted Y divided by an appropriate standard error • Rstudent (aka Studentized Deleted Residual) – same except for each case the regression surface is that obtained when this individual case is removed. • Investigate if greater than 2.
Influence on the Solution • Cook’s D – how much would the regression surface change if this case were removed • Investigate cases with D > 1. • Dfbetas – how much would one parameter (slope or intercept) change if this case were removed • Investigate cases with values > 2.
Simple Example • Y = sperm count • X1 = % time recently spent with mate • X2 = time since last ejaculation
Leverage • Investigate cases with values greater than 2(3)/11 = .55. • Case 7 is close to this cutoff. • It is a univariate outlier on the time together variable. • Further investigation indicates the case is valid, so we retain it.
Residuals • Case 11 has large residuals, it should be investigated. • Notice that Rstudent is much larger than the standardized residual • This indicates that removing this case has a large effect on the solution.
Influence • Case 11 has a high value of Cook’s D. • It has a high Dfbeta for the time since last ejaculation predictor, even after I transformed that variable to reduce skewness. • Upon investigation, it was found that this subject did not follow the instructions for gathering the data. His scores were deleted.
Plots of Residuals • These can also be useful, but • It takes some practice to get good at detecting problems from such plots • Plot the residual versus predicted Y