Lecture 8: Regression Diagnostics

Lecture 8:Regression Diagnostics February 5, 2014

Question Assume we’re trying to estimate SAT Math scores by HS GPA and all students have a GPA of at least 1.0 or more • Let SAT.M be the SAT Math score • Let GPA be the standardized z-score of GPA We’ve fitted the following model: Which of the following is correct? • The intercept is an extrapolation since all students have a GPA >= 1.0 • The intercept is the SAT.M score for someone with an average GPA • On average, a 1 point increase in GPA would be associated with a 137 point increase in SAT math • None of the above, more than one of the above, or not enough information to tell • I have no idea

Administrative • Quiz 2 results online • Overall a little better: average = 14.84 • Homework 4: • Due Monday (9am). I’ll post the solutions at 9 but won’t be graded before the exam. • Exam 1: Next Wednesday (1 week) • 80min; exams turned in at 10:20/11:50. • Open book / hardcopy notes. No electronic devices or files. • All material through chapter 22 • Monday: • Finish chapter 22. • More examples and Review • If there is a topic or question you want me to review in lecture, email me by Saturday morning. • Discussion board is always an option but don’t expect feedback Tuesday night.

Prediction Intervals • Say we have a new X observation and we want to predict where the Y will occur. • The regression line will give the expected (average) value of Y given X • If we want to be more certain about where Y will occur, then we need to calculate the Prediction Interval:

Prediction Intervals Different than a confidence interval derived from se. Why? • Think about what we’ve done so far: • We have some data and we estimated b0 and b1 for some true β’s. • We can calculate a confidence interval for the β’s given our b’s. • That standard error of the regression, se, from before was for the true β’s • The data we have to estimate our regression line is a sample from some larger population. With a different draw of data we might get slightly different set of data. • Look at the formula again: what happens as n  ∞?

Prediction Intervals • So when we have infinite data (or “large enough”): • se(ynew) = se • Also notice how the size of the se(ynew) changes depending on where the xnew value is: • Goodness of prediction decreases as we move away from the average x

Question • Using the Diamonds.csvdata, estimate a simple regression model to predict price by weight. The 95%-confidence interval around the intercept is approximately: • [$-97.97, $184.95] • [$43.48, $71.90] • [$2330.41, $3009.09] • [$-296.94, $383.92]

Prediction Interval • Diamonds Data • Estimate Price by Weight • Predict the price of a diamond that weighs 0.49 carats • $1351.67 • $1244.47 • $1453.42 • $1137.26 • Use 2 Se to give an approximate interval • 1351.67 +/- 2* Se  [1011.24, 1692.09] • Use StatTools to Predict obs and interval • [1015.24, 1688.09] • On your own: repeat but choose a point farther away from the mean weight. • truncate dataset (20 obs) and repeat

Quantile Plots • Why look at quantile plots? • Normality assumption: residuals are normally distributed and plotting a histogram can be misleading • We should generate a quantile plot of the residuals of the model and verify they are in fact distributed Normal • How? • Possible in Excel (might show you next time) but a slight pain • Easy in most statistics software packages • StatTools Q-Q plot.

Common Problems: Leveraged Outliers • We’ve talked about the dangers of extrapolating from the range of our data, but how do we know the range of our data is OK for making inference? • We don’t always. Is the obswith a very large size an outlier?

Common Problems: Leveraged Outliers • Deciding if an observation is an outlier is often very difficult • What’s important – and easier – is seeing if your analysis is sensitive to that observation. I.e., is the observation changing the results? (aka leveraging) How do you check? • If you suspect an outlier: rerun the analysis without that observation

Common Problems: Leveraged Outliers • Does it matter? To see the consequences of an outlier, fit the model with and without that observation.

Common Problems: Leveraged Outliers Use the standard errors obtained to compare the estimates from both models: • How does including the outlier change the estimate of b0? • Increases the intercept by 1.5 standard errors • Change b1? • Including the outlier shifts estimated marginal cost (the slope) down by a little over 1.5 standard errors. • What about r2? • r2 is lower without the outlier. Isn’t this bad? No… not always. r2 is just one summary statistic of model fit. In this example se is smaller

Common Problems: Leveraged Outliers Fixing it: • So what if you find your results are sensitive to the inclusion of a observation (or two)? • First: usually very good (and useful) to know. • Second: sometimes data that look like outliers and change the results of the analysis are not outliers. They’re important and different data, but knowing that is important. • In general, be cautious and try to get more info. • If the outlier describes what is expected the next time under the same conditions, then it should be included. • In the contractor example, more information is needed to decide whether to include or exclude the outlier.

Common Problems:Heteroscedasticity It sounds like a disease but it’s not. It’s actually quite common. • Problem: Changing Variation • It’s common; we’ve actually already seen it. • Heteroscedastic: errors have different amounts of variation • Homoscedastic: errors have equal amounts of variation • Ex: Predicting home prices by size: • Linear? • Yes • Omitted variable? • Sample of homes is from the same neighborhood. We’re probably OK • Variance is not equal. The variation in price is increasing as home size increases:

Common Problems:Heteroscedasticity Detecting differences in variation: • Look at the x-y scatterplot. Sometimes hard to tell. How else? • Look at the residuals • Fan-shaped scatterplot: • Side-by-side boxplots:

Common Problems:Heteroscedasticity It’s common, so what does that mean for my analysis? • Point estimates for β’s: • Still OK. No bias. • Prediction and Confidence intervals: • Not reliable; too narrow or too wide. • Hypothesis tests regarding β0 and β1 are not reliable.

Common Problems:Heteroscedasticity Fixing the problem: • Revise the model: how will depend on the substance. Originally: Price = β0 + β1 * SqFt + ε where β0 would be fixed costs and β1 would be marginal costs • Instead estimate Price/SqFt by dividing the original eq by SqFt: Notice the change in the intercept and slope:

Common Problems:Heteroscedasticity Fixing the problem: • Revise the model: Price/SqFt = M + F * (1/SqFt) + ε • The response variable becomes price per square foot and the explanatory variable becomes the reciprocal of the number of square feet. • The marginal cost M is the intercept and the slope is Fthe fixed cost. • Do the residuals have similar variances? • In this case they do:

Common Problems:Heteroscedasticity Comparing the revised and original model: • Revised model may have different (and smaller) r2. • Again, so? R2 is great but it’s only one notion of fit. And we may have a different response variable anyway. • It provides a narrower confidence interval for fixed and variable costs:

Common Problems:Heteroscedasticity Comparing the revised and original model: • It also provides a more sensible prediction interval • Don’t blindly fit models. Think about the substance and data! • The data originally indicated that large homes varied in price more:

Question • If the outlier shown in the scatterplot is removed and the regression of revenue on TV advertising is done again, the results would indicate that which of the following has occurred? An increase in the standard error of the residuals. Very little change in the slope of the regression line. A decrease in the standard error of the residuals. A change in the slope of several standard errors when compared to the slope with the outlier in the data. Both (b) and (c).

Common Problems:Correlated Errors Problem: Dependence between residuals (autocorrelation) • The amount of error (detected by the size of the residual) you make at time t + 1 is related to the amount of error you make at time t. • Why is this a problem? • SRM assumes that ε are independent. • Common problem for time series data, but not just a time series problem. • Any ideas where else this might be common?

Common Problems:Correlated Errors Detecting the problem: • Much easier with time series data: plot the residuals versus time and look for a pattern. Not guaranteed to find it but often helpful. • Use the Durbin-Watson statistic to test for correlation between adjacent residuals (aka serial- or auto-correlation) • With time series data adjacency is temporal. • In non time series data, adjacency is less obvious but we’re talking about errors “next to” one another being related. • For things like spatial autocorrelation, there are more advanced things like mapping the residuals and tests we can do

Durbin-Watson Statistic • Tests to see if the correlation between the residuals is 0 • Null hypothesis: H0: ρε = 0 • It’s calculated as: • From the Durbin-Watson, D,statistic and sample size you can calculate the p-value for the hypothesis test

Common Problems:Correlated Errors Consequences of Dependence: • With autocorrelation in the errors the estimated standard errors are too small • Estimated slope and intercept are less precise than as indicated by the output How do you fix it? • Usually by modeling it (revising the model). • We’ll go through an example next time

Lecture 8: Regression Diagnostics