200 likes | 227 Views
Learn about influential outliers and leverage points in simple linear regression, quantify their impact, and assess model assumptions with residual plots and diagnostics tools. Understand how to transform models when assumptions are violated.
E N D
Diagnostics and Transformation for SLR • In general, it makes sense to base inference and conclusions only on valid models. • So we need to make sure we are fitting an appropriate model. • For this we need to plot the data! • Example… STA302/1001 week 5
Influential Points, Outliers and Leverage Points • Observations whose inclusion/exclusion result in substantial changes in the fitted model are said to be influential. • A point can be outlying in any (or all) of the value of the explanatory variable, the dependent variable or its residual. • Outlier with respect to the residual represents model failure, i.e., line doesn’t fit this point adequately. These are typically outliers with respect to the dependent variable. • Outlier with respect to the explanatory variable are called leverage points. They may be influential, uniquely determine the regression coefficient and possibly cause the S.E. of the regression coefficient to be smaller than they would be if the point was removed. • Textbook distinguish between “good” leverage points that follow the pattern of the data and “bad” leverage points that are influential. STA302/1001 week 5
Quantifying Leverage • To determine if a point is a leverage point we calculate the following… STA302/1001 week 5
Measuring Influence of the ith Observation • There are three main measurements for assessing the influence of an observation. • Each of these measures uses different aspect of the fitted model to assess the influence of an observation. • Notation: Subscript (i) indicates that the ith observation has been deleted from the data and regression was re-fit using remaining n-1 data points. STA302/1001 week 5
Measurement I for Influence – DFBETAS • This measure examines how estimates of β0 and β1 change with and without the ith observation. • It is the difference in beta’s defined by… • Interpretation…. STA302/1001 week 5
Measurement II for Influence – DFFITS • This measure examines how the ith predicted value changes with and without the ith observation in the model. • It is the difference in predicted values defined by:… • Interpretation… STA302/1001 week 5
Measurement III for Influence – Cook’s Distance • Cook’s distance measures how much fit of all points changes with and without the ith observation in data. • It is defined by:… • Interpretation… STA302/1001 week 5
Residuals • The residuals are estimate of the error term, εi , in the model. • What do we know about the ei? • Further, since ei are linear function of the Yi they are random variables with mean and variance…. • Further, they have a Normal distribution but they are NOT uncorrelated. • However, as n ∞, with number of predictors stay constant, we have that the correlation in ei’s goes to 0 and the variance become constant. • So we will ignore these problems with using ei as estimates of εi. STA302/1001 week 5
Possible Departures from Model Assumptions • We will use the residuals to examine the following possible departures from the simple linear regression model with normal errors. • The regression function is not linear, i.e, the straight line model is not appropriate. • The error terms do not have constant variance. • The error terms are not normally distributed. • There are outliers and /or influential points. STA302/1001 week 5
Residual Plots • Residual plots are used to check the model assumptions. • We look for evidence of any of the possible departure described above. • The recommended plots are: residuals versus the predicted values, residuals versus the Xi and a normal quantile plot of the residuals. STA302/1001 week 5
Other Diagnostics tools • Univariate analysis of standardized residuals such as stem-and-leaf plot, box-plot and histogram is useful for examining departure from the normal distribution. • Absolute value of residuals versus predicted (fitted) values is useful in examining if the variance of the errors is constant. This plot show non-constant variance more sharply. • Standardized residuals versus time or other spatial sequence in observations helps indicate correlation in observations. • Standardize residuals versus potential other predictors. This plot helps us determine whether we should include the other predictor in the model. STA302/1001 week 5
What to do if Assumptions are Violated? • Abandon simple linear regression for something else (usually more complicated). • Some examples of alternative models: • weighted least square – appropriate model if the variance is non-constant. • methods that allow for non-normal errors (STA303). • methods that allow for correlated errors (e.g., time series, longitudinal models). • polynomial or other non-linear models. STA302/1001 week 5
Dealing with Outliers / Influential points • First, check for clerical / measurement error. • Consider transformation if the points come from a skewed distribution or distribution with long tail. • Use robust regression which is appropriate when errors are from a distribution with heavy tails. • Consider reporting results with and without the outliers. • Think about whether an outlier is beyond the region where linear model holds; then fit the model on restricted range of the independent variable to exclude unusual points. • For gross outliers that are probably mistakes, consider deleting them but be cautious if there is no evidence of mistake. STA302/1001 week 5
Transformations • Transformations are used as a remedy for non-linearity, non- constant variance and non-normality. • If relationship is non-linear but variance of Y is approximately constant, try to find a transformation of X that results in a linear relationship. • Most common monotonic transformations are: • If not a straight line and non-constant variance, transform Y. • If straight line and non-constant variance, transform both X and Y or use weighted least square. • Transforming changes the relative spacing of the observations. STA302/1001 week 5
Transformation to Stabilize the Variance • If Y has a distribution with mean and variance . Then the mean and variance of Z = f (Y) are approximately, • Proof: • This result gets used to derive variance stabilizing transformations. STA302/1001 week 5
Examples STA302/1001 week 5
SAS Example • In an industrial laboratory, under uniform conditions, batches of electrical insulating fluid were subjected to constant voltages until the insulating property of the fluids broke down. Seven different voltage levels, space 2 kV apart from 26 to 38 kV, were studied. • The measured responses were the times, in minutes, until breakdown. STA302/1001 week 5
Interpreting log-transformed Data • If logY = β0 + β1X + ε then, . • The errors are multiplicative. • Increase in X of 1 unit is associated with a multiplicative change in Y by a factor of . • Example: • If Y = β0 + β1logX + ε, for each k-fold change in X, Y changes by β1logk. • Example: if X is cut in half, Y changes, on average by β1log(½). STA302/1001 week 5
Violation of Normality of ε’s • By the Central Limit Theorem, linear combinations of random variables are approximately normally distributed, no matter what their original distribution is. • So CIs and tests for β0, β1, and E(Y | X) are robust against non- normality (i.e., have approximately the correct coverage or approximately the correct P-value) • Prediction Intervals are not robust against departure from Normality because they are for one point. STA302/1001 week 5
Relative Importance of Assumptions • The most important assumption is that the form of the model is appropriate and E(ε) = 0. • The second most important assumption is independence of observations. • The third important assumption is the constant variance. • The least important assumption is Normality of the residuals, because of the CLT. It is, however, a necessary assumption for PI’s. STA302/1001 week 5