160 likes | 454 Views
Regression. Hal Varian 10 April 2006. What is regression?. History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov theorem Maximum likelihood Conditional mean What can go wrong… Examples. Francis Galton, 1877. Plotted first regression line:
E N D
Regression Hal Varian 10 April 2006
What is regression? • History • Curve fitting v statistics • Correlation and causation • Statistical models • Gauss-Markov theorem • Maximum likelihood • Conditional mean • What can go wrong… • Examples
Francis Galton, 1877 • Plotted first regression line: • Diameter of sweetpeas v diameter of parents • Heights of fathers v heights of sons • Sons of unusually tall fathers tend to be tall, but shorter than their fathers. Galton called this “regression to mediocrity”. • But this is also true the other way around! Regression to the mean fallacy. • Pick the lowest scoring 10% on the midterm and give them extra tutoring • If they do better on the final, what can you conclude? Did the tutoring help?
Regression analysis • Assume a linear relation between two variables and estimate unknown parameters • yt = a + b xt + et for t= 1,…,T • observed = fitted + error or residual • dependent variable ~ independent variables/predictors/correlates
Curve fitting v regression • Often choose (a,b) to minimize the sum of squared residuals (“least squares”) • Why not absolute value of residuals? • Why not fit xt = a + b yt? • How much can you trust the estimated values? • Need a statistical model to answer these questions! • Linear regression: linear in parameters • Nonlinear regression, local regression, general linear model, general additive model: same principles apply
Possible goals • Estimate parameters (a , b and error variance) • Test hypotheses (such as “x has no influence on y”) • Make predictions about y conditional on observing a new x-value • Summarize data (most common unstated goal!)
Summarizing relationships • Would like to be able to interpret regression as “causal” • “If x changes by Dx, then y will on average change by Dx b.” • Correlation v causation • Compare the time on my wristwatch with the time on your wristwatch… • Even ideally, best you can say is: • “When x changes by Dx in the sample, then on average y changes by Dx b in the sample.”
Problem with causality • There may be a “third cause” • “my watch time” and “your watch time” both depend on NIST time • Economics example • income ~ b education + (unobserved IQ+other) • education ~ IQ • Higher income is associated with higher education in sample, but b is a biased estimate of partial effect of education on income • Need a controlled experiment or more elaborate estimation technique to resolve this “simultaneous equations bias”
Statistical regression model • yt = a + b xt + et for t= 1,…,T • Think of random variable et as the sum of the other omitted effects • What are attractive properties for error term? • E et = 0 • Var et = constant • E etes = 0 (errors are independent) • E xtet = 0 (errors are conditionally uncorrelated with explanatory variables – often problematic for reasons on last slide! Exogenous v endogenous.) • Have to ask: how do the variables you don’t observe affect the variables you do observe?
Optimality properties • Gauss-Markov theorem: If the error term has these properties, then the linear regression estimates of (a,b) are BLUE = “best linear unbiased estimates” = out of all unbiased estimates that are linear in yt the least squares estimates have minimum variance. • If et are Normal IID distributed, then the OLSQ estimates are maximum likelihood estimates
Conditional means • In the regression model, note that the expected value of yt is a + b xt . So the conditional mean is linear in xt, which is another interpretation of regression. • More generally, can think of regression model as being: E yt = f(xt, b)
Regression output • Estimates of parameters • Standard errors of estimates and error term • t-statistics = estimate/se and p-values • R2 = goodness of fit measure • Total SS = Fitted SS + Residual SS • R2 = Fitted SS / Total SS
Example from R > x <- 1:100 > y <- x + 10*rnorm(100) > summary(lm(y~x)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.81944 1.86779 0.974 0.332 x 0.97354 0.03211 30.319 <2e-16 *** Residual standard error: 9.269 on 98 degrees of freedom Multiple R-Squared: 0.9037, Adjusted R-squared: 0.9027
What can go wrong? • Nonlinear relationship • Try quadratic, interaction term, logs, etc. • Var et is not constant • Heteroskedasticity – affects testing not estimates • Take logs or use weighted least squares • Serial correlation – affects testing and prediction accuracy • Use time series methods • Multiple regression – colinearity • Socks ~ right shoes + left shoes + shoes + error
What can go wrong, cont • Errors in variables • Underestimate magnitude of true effect • Omitted variable bias • Bias depending on correlation of omitted with included variables • Simultaneous equations bias • Third cause alluded to earlier, need to estimate full model or use controlled experiment • Outliers • Non-normality of errors and influential observations – remove them or use robust estimation
Diagnostics • Look at residuals!! • R allows you to plot various regression diagnostics • reg <- lm(y~x) • plot(reg) • Examples to follow…