1 / 97

Statistical Inference and Regression Analysis: GB.3302.30

Statistical Inference and Regression Analysis: GB.3302.30. Professor William Greene Stern School of Business IOMS Department Department of Economics. Statistics and Data Analysis. Part 7 – Regression Model-1 Regression Diagnostics. Using the Residuals.

marcy
Download Presentation

Statistical Inference and Regression Analysis: GB.3302.30

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Inference and Regression Analysis: GB.3302.30 Professor William Greene Stern School of Business IOMS Department Department of Economics

  2. Statistics and Data Analysis Part 7 – Regression Model-1 Regression Diagnostics

  3. Using the Residuals • How do you know the model is “good?” • Various diagnostics to be developed over the semester. • But, the first place to look is at the residuals.

  4. Residuals Can Signal a Flawed Model • Standard application: Cost function for output of a production process. • Compare linear equation to a quadratic model (in logs) • (123 American Electric Utilities)

  5. Electricity Cost Function

  6. Candidate Model for Cost Log c = a + b log q + e Most of the points in this area are above the regression line. Most of the points in this area are above the regression line. Most of the points in this area are below the regression line.

  7. A Better Model? Log Cost = α + β1 logOutput + β2 [logOutput]2 + ε

  8. Candidate Models for Cost The quadratic equation is the appropriate model. Logc = a + b1 logq + b2 log2q + e

  9. Missing Variable Included Residuals from the quadratic cost model Residuals from the linear cost model

  10. Unusual Data Points Outliers have (what appear to be) very large disturbances, ε Wolf weight vs. tail length The 500 most successful movies

  11. Outliers 99.5% of observations will lie within mean ± 3 standard deviations. We show (a+bx) ± 3se below.) Titanic is 8.1 standard deviations from the regression! Only 0.86% of the 466 observations lie outside the bounds. (We will refine this later.) These observations might deserve a close look.

  12. Prices paid at auction for Monet paintings vs. surface area (in logs) logPrice = a + b logArea + e Not an outlier: Monet chose to paint a small painting. Possibly an outlier: Why was the price so low?

  13. What to Do About Outliers (1) Examine the data (2) Are they due to mismeasurement error or obvious “coding errors?” Delete the observations. (3) Are they just unusual observations? Do nothing. (4) Generally, resist the temptation to remove outliers.Especially if the sample is large. (500 movies islarge. 10 wolves is not.) (5) Question why you think it is an outlier. Is it really?

  14. Regression Options

  15. Diagnostics

  16. On Removing Outliers Be careful about singling out particular observations this way. The resulting model might be a product of your opinions, not the real relationship in the data. Removing outliers might create new outliers that were not outliers before. Statistical inferences from the model will be incorrect.

  17. Statistics and Data Analysis Part 7 – Regression Model-2 Statistical Inference

  18. b As a Statistical Estimator • What is the interest in b? •  = dE[y|x]/dx • Effect of a policy variable on the expectation of a variable of interest. • Effect of medication dosage on disease response • … many others

  19. Application: Health Care Data German Health Care Usage Data, There are altogether 27,326 observations on German households, 1984-1994. DOCTOR = 1(Number of doctor visits > 0) HOSPITAL = 1(Number of hospital visits > 0) HSAT =  health satisfaction, coded 0 (low) - 10 (high)   DOCVIS =  number of doctor visits in last three months HOSPVIS =  number of hospital visits in last calendar yearPUBLIC =  insured in public health insurance = 1; otherwise = 0 ADDON =  insured by add-on insurance = 1; otherswise = 0 INCOME =  household nominal monthly net income in German marks / 10000.HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC =  years of schooling AGE = age in years MARRIED = marital status EDUC = years of education

  20. Regression? • Population relationshipIncome =  + Health +  • (For this population,Income = .31237+ .00585 Health + E[Income | Health] = .31237 + .00585 Health

  21. Distribution of Health

  22. Distribution of Income

  23. Average Income | Health Health Nj = 447 255 642 1173 1390 4233 2570 4191 6172 3061 3192

  24. b is a statistic • Random because it is a sum of the ’s. • It has a distribution, like any sample statistic

  25. Sampling Experiment • 500 samples of N=52 drawn from the 27,326 (using a random number generator to simulate N observation numbers from 1 to 27,326) • Compute b with each sample • Histogram of 500 values

  26. Conclusions • Sampling variability • Seems to center on  • Appears to be normally distributed

  27. Distribution of slope estimator, b • Assumptions: • (Model) Regression: yi =  + xi + i • (Crucial) Exogenous data: data x and noise  are independent; E[|x]=0 or Cov(,x)=0 • (Temporary) Var[|x] = 2, not a function of x(Homoscedastic) • Results: What are the properties of b?

  28. (1) b is unbiased and linear in 

  29. (2) b is efficient • Gauss – Markov Theorem: Like Rao Blackwell. (Proof in Greene) • Variance of b is smallest among linear unbiased estimators.

  30. (3) b is consistent

  31. Consistency: N=52 vs. N=520

  32. a is unbiased and consistent

  33. Covariance of a and b

  34. Inference about  • Have derived expected value and variance of b. • b is a ‘point’ estimator • Looking for a way to form a confidence interval. • Need a distribution and a pivotal statistic to use.

  35. Normality

  36. Confidence Interval

  37. Estimating sigma squared

  38. Usable Confidence Interval • Use s instead of s. • Use t distribution instead of normal. • Critical t depends on degrees of freedom • b - ts <  < b + ts

  39. Slope Estimator

  40. Regression Results ----------------------------------------------------------------------------- Ordinary least squares regression ............ LHS=BOX Mean = 20.72065 Standard deviation = 17.49244 ---------- No. of observations = 62 DegFreedom Mean square Regression Sum of Squares = 7913.58 1 7913.57745 Residual Sum of Squares = 10751.5 60 179.19235 Total Sum of Squares = 18665.1 61 305.98555 ---------- Standard error of e = 13.38627 Root MSE 13.16860 Fit R-squared = .42398 R-bar squared .41438 Model test F[ 1, 60] = 44.16247 Prob F > F* .00000 --------+-------------------------------------------------------------------- | Standard Prob. 95% Confidence BOX| Coefficient Error t |t|>T* Interval --------+-------------------------------------------------------------------- Constant| -14.3600** 5.54587 -2.59 .0121 -25.2297 -3.4903 CNTWAIT3| 72.7181*** 10.94249 6.65 .0000 51.2712 94.1650 --------+-------------------------------------------------------------------- Note: ***, **, * ==> Significance at 1%, 5%, 10% level. -----------------------------------------------------------------------------

  41. Hypothesis Test about  • Outside the confidence interval is the rejection for hypothesis tests about  • For the internet buzz regression, the confidence interval is 51.2712 to 94.1650 • The hypothesis that  equals zero is rejected.

  42. Statistics and Data Analysis Part 7-3 – Prediction

  43. Predicting y Using the Regression • Actual y0 is  + x0 + 0 • Prediction is y0^ = a + bx0 + 0 • Error is y0 – y0^ = (a-) + (b-)x0 + 0 • Variance of the error isVar[a] + x02 Var[b] + 2x0 Cov[a,b] + Var[0]

  44. Prediction Variance

  45. Quantum of Solace • Actual Box = $67.528882M • a=-14.36, b=72.7181, N=62, sb =10.94249, s2 = 13.38632 • buzz = 0.76, prediction = 40.906 • Mean buzz = 0.4824194 • (buzz – mean)2 = 1.49654 • Sforecast = 13.8314252 • Confidence interval = 40.906 +/- 2.003(13.831425) = 13.239 to 68.527(Note: The confidence interval contains the value)

  46. Forecasting Out of Sample Regression Analysis: G versus Income The regression equation is G = 1.93 + 0.000179 Income Predictor Coef SE Coef T P Constant 1.9280 0.1651 11.68 0.000 Income 0.00017897 0.00000934 19.17 0.000 S = 0.370241 R-Sq = 88.0% R-Sq(adj) = 87.8% How to predict G for 2012? You would need first to predict Income for 2012. How should we do that? Per Capita Gasoline Consumption vs. Per Capita Income, 1953-2004.

  47. The Extrapolation Penalty The interval is narrowest at x* = , the center of our experience. The interval widens as we move away from the center of our experience to reflect the greater uncertainty.(1) Uncertainty about the prediction of x(2) Uncertainty that the linear relationship will continue to exist as we move farther from the center.

  48. Normality • Necessary for t statistics and confidence intervals • Residuals reveal whether disturbances are normal? • Standard tests and devices

More Related