Regression Diagnostics

Regression Diagnostics SRM 625 Applied Multiple Regression, Hutchinson

Prior to interpreting your regression results, you should examine your data for potential problems that could affect your findings using various diagnostic techniques SRM 625 Applied Multiple Regression, Hutchinson

Types of possible problems • Assumption violations • Outliers and influential cases • Multicollinearity SRM 625 Applied Multiple Regression, Hutchinson

Regression Assumptions • Error-free measurement • Correct model specification • Assumptions about residuals SRM 625 Applied Multiple Regression, Hutchinson

Assumption that variables are measured without error • Presence of measurement error in Y leads to increase in standard error of estimate • If standard error of estimate is inflated what happens to the F test for R2? • (hint: think about the relationship between the standard error and mean square error) SRM 625 Applied Multiple Regression, Hutchinson

In a bivariate regression, measurement error in X always leads to underestimation of regression coefficient • What are the implications of this for interpreting results regarding X? SRM 625 Applied Multiple Regression, Hutchinson

What are the possible consequences of measurement error when one or more IVs has poor reliability in a multiple regression model? SRM 625 Applied Multiple Regression, Hutchinson

Evidence to assess violation of the assumption of error-free measurement • Reliability estimates for your independent and dependent variables • What would constitute "acceptable" reliability? SRM 625 Applied Multiple Regression, Hutchinson

How might you attempt to minimize violation of the assumption during the design and planning phase of your study? SRM 625 Applied Multiple Regression, Hutchinson

Assumption that the regression model has been correctly specified • Linearity • Inclusion of all relevant independent variables • Exclusion of irrelevant independent variables SRM 625 Applied Multiple Regression, Hutchinson

Assumption of Linearity • Violation of this assumption can lead to downward bias of regression coefficients • If data are curvilinearly related there are methods for dealing with curvilinear data • Require use of multiple regression and transformation of variables • Note: we will discuss methods for addressing nonlinear relationships later in the course SRM 625 Applied Multiple Regression, Hutchinson

Detecting nonlinearity • In bivariate, can examine scatterplots of X and Y • Not sufficient in multiple regression • However, can examine partial regression plots between each IV and the DV, controlling for other IVs • In multiple regression, residuals plots are primarily used SRM 625 Applied Multiple Regression, Hutchinson

Residuals plots • Typically involve scatterplots with either standardized, studentized, or unstandardized residuals plotted against predicted Y, i.e., versus SRM 625 Applied Multiple Regression, Hutchinson

A residuals scatterplot should reflect a broad horizontal band of points (i.e., should look like scatterplot for r = 0). • If plot forms some type of pattern, it could indicate an assumption violation • Specifically, for nonlinearity the plot would reflect a curve SRM 625 Applied Multiple Regression, Hutchinson

Sample residuals plot Does this appear to be a correlation = 0? SRM 625 Applied Multiple Regression, Hutchinson

Sample partial regression plot SRM 625 Applied Multiple Regression, Hutchinson

Assumption that all important independent variables have been included • If omitted variables are correlated with variables in equation, violation of this assumption can lead to biased parameter estimates (e.g., incorrect values of regression coefficients) • Fairly serious violation SRM 625 Applied Multiple Regression, Hutchinson

Violation can also lead to non-random residuals (i.e., residuals that include systematic variance associated with the omitted variables) • If omitted variables are not correlated with variables in the model, parameter estimates are not biased, but standard errors associated with the independent variables are biased upward (i.e., inflated) SRM 625 Applied Multiple Regression, Hutchinson

For example: Error includes: autonomy task enjoyment working conditions etc. Job Satisf Salary Therefore, if autonomy, task enjoyment, etc. are correlated with job satisfaction, residuals (which reflect autonomy, task enjoyment, etc.), would be correlated with predicted job satisfaction

How do we determine if this assumption is violated? • Can examine residuals plots • Again, plot residuals against predicted values of Y • Again, hope to see a broad horizontal band of points • If plot reflects some type of discernable pattern, e.g., a linear pattern, it could suggest omitted variables SRM 625 Applied Multiple Regression, Hutchinson

What can you do if it appears you have violated this assumption? SRM 625 Applied Multiple Regression, Hutchinson

How might we attempt to prevent violation of this assumption? SRM 625 Applied Multiple Regression, Hutchinson

Assumption that no irrelevant independent variables have been included • Will lead to inflated standard errors for the regression coefficients (not just those corresponding to the irrelevant variables) • What effect could this have on conclusions you draw about the contributions of your independent variables? SRM 625 Applied Multiple Regression, Hutchinson

How can you determine if you have violated this assumption? SRM 625 Applied Multiple Regression, Hutchinson

What might you do to avoid this potential assumption violation? SRM 625 Applied Multiple Regression, Hutchinson

Assumptions about errors • Residuals have mean of zero • Residuals are random • Residuals are normally distributed • Residuals have equal variance (i.e., homoscedasticity) SRM 625 Applied Multiple Regression, Hutchinson

Residuals (or errors) are random • Residuals should be uncorrelated with both Y and predicted Y • Residuals should be uncorrelated with independent variables • Residuals should be uncorrelated with one another • This is comparable to the independence of observations assumption • What this means is that the reason for prediction error for one person should be unrelated to the reason for prediction error for another person SRM 625 Applied Multiple Regression, Hutchinson

If violate, tests of significance cannot be trusted • F and t tests are not robust to violations of this assumption • This assumption is most likely to be violated: • in longitudinal studies, or • when important variables have been left out of the equation, or • if observations are clustered, e.g., • When subjects are sampled from intact groups or in cluster sampling SRM 625 Applied Multiple Regression, Hutchinson

Residuals are normally distributed • Residuals are assumed to be normally distributed around the regression line for all values of X • This is analogous to the normality assumption in a t-test or ANOVA SRM 625 Applied Multiple Regression, Hutchinson

Illustration of data which violate assumption of normality

Normal probability plot of residuals SRM 625 Applied Multiple Regression, Hutchinson

Residuals have equal variance • Residuals should be evenly spread around the regression line • Known as the assumption of homoscedasticity • Same as assumption of homogeneity of variance in ANOVA but with equal variances on Y for each value of X SRM 625 Applied Multiple Regression, Hutchinson

Illustration of homoscedastic data SRM 625 Applied Multiple Regression, Hutchinson

Illustration of heteroscedasticity SRM 625 Applied Multiple Regression, Hutchinson

Further evidence of heteroscedasticity and nonnormality SRM 625 Applied Multiple Regression, Hutchinson

Why is violation of the homoscedasticity assumption a problem? SRM 625 Applied Multiple Regression, Hutchinson

What can you do if your data are heteroscedastic? • Can use weighted least squares instead of ordinary least squares as your estimation procedure • WLS weights each case so that cases with larger error variances receive less weight (in OLS each case is weighted 1) SRM 625 Applied Multiple Regression, Hutchinson

Outliers and Influential Cases • Outliers • Influential observations • Leverage • Extreme on both X and Y SRM 625 Applied Multiple Regression, Hutchinson

What is an outlier? • A case with an extreme value of Y • Presence of outliers can be detected by examination of residuals SRM 625 Applied Multiple Regression, Hutchinson

Types of residuals used in outlier detection • Standardized residuals • Studentized residuals • Studentized deleted residuals SRM 625 Applied Multiple Regression, Hutchinson

Standardized Residuals • Unstandardized residuals that have been converted to z-scores • Not recommended by some because their calculation makes the assumption that all residuals have the same variance (as measured by the overall Sy.x) SRM 625 Applied Multiple Regression, Hutchinson

Studentized Residuals • Similar to standardized residuals but use different standard deviations for each residual • Generally more sensitive than standardized residuals • Follow an approximate t distribution SRM 625 Applied Multiple Regression, Hutchinson

Studentized Deleted Residuals • Studentized deleted residuals are the same as studentized residuals except they remove the case with the extreme value from their calculation • Addresses a potential problem of studentized residuals which include the outlier in their calculation (thus increasing risk of inflated standard error) SRM 625 Applied Multiple Regression, Hutchinson

Comparing the three types of residuals SRM 625 Applied Multiple Regression, Hutchinson

Leverage • Reflects cases with extreme values on one or more of the independent variables • May or may not exert influence on the equation SRM 625 Applied Multiple Regression, Hutchinson

How does one identify cases with high leverage? • SPSS produces values of leverage (h) which can range between 0 and 1 • One "rule of thumb" suggests h > 2(k + 1)/N as a high leverage value • Another rule of thumb is that h ≤ .2 indicates trivial leverage whereas values > suggests substantial leverage requiring further examination • Other researchers recommend looking at relative differences SRM 625 Applied Multiple Regression, Hutchinson

Leverage Example (based on 3 IVS, N = 171) SRM 625 Applied Multiple Regression, Hutchinson

Mahalanobis distance (D2) • A method for detecting multivariate outliers, i.e., cases with unexpected combinations of independent variables • Represents the distance of a case from the centroid of the remaining cases, where the centroid represents the intersection of the means of all the variables • One rule of thumb suggests high values exceed the 2 critical with degrees of freedom equal to the number of IVs in the model SRM 625 Applied Multiple Regression, Hutchinson

Mahalanobis D2 example Note: model based on 6 IVs SRM 625 Applied Multiple Regression, Hutchinson

It should be noted that just because a case is an outlier and/or exhibits high leverage does not necessarily mean it is influential SRM 625 Applied Multiple Regression, Hutchinson

Regression Diagnostics