Econometrics

Econometrics Session 3 – Linear Regression Amine Ouazad, Asst. Prof. of Economics

Outline of the course • Introduction: Identification • Introduction: Inference • Linear Regression • Identification Issues in Linear Regressions • Inference Issues in Linear Regressions

This sessionIntroduction: Linear Regression • What is the effect of X on Y? • Hands-on problems: • What is the effect of the death of the CEO (X) on firm performance (Y)? (Morten Bennedsen) • What is the effect of child safety seats (X) on the probability of death (Y)? (Steve Levitt)

This session:Linear Regression • Notations. • Assumptions. • The OLS estimator. • Implementation in STATA. • The OLS estimator is CAN.Consistent and Asymptotically Normal • The OLS estimator is BLUE.*Best Linear Unbiased Estimator (BLUE)* • Essential statistics: t-stat, R squared, Adjusted R Squared, F stat, Confidence intervals. • Tricky questions. *Conditions apply

Session 3 – Linear Regression 1. Notations

Notations • The effect of X on Y. • Whatis X? • K covariates (including the constant) • N observations • X is an NxK matrix. • Whatis Y? • N observations. • Y is an N-vector.

Notations • Relationship between y and the xs. y=f(x1,x2,x3,x4,…,xK)+e • f: a function K variables. • e: the unobservables (a scalar).

Session 3 – Linear Regression 2. Assumptions

Assumptions • A1: Linearity • A2: Full Rank • A3: Exogeneity of the covariates • A4: Homoskedasticity and nonautocorrelation • A5: Exogenouslygeneratedcovariates. • A6: Normality of the residuals

Assumption A1: Linearity • y = f(x1,x2,x3,…,xK)+e • y = x1 b1 + x2 b2 + …+xKbK + e • In ‘plain English’: • The effect of xkis constant. • The effect of xkdoes not depend on the value of xk’. • Not satisfied if : • squares/higherpowersof x matter. • Interaction termsmatter.

Notations • Data generating process • Scalar notation • Matrix version #1 • Matrix version #2

Assumption A2: Full Rank • We assume that X’X isinvertible. • Notes: • A2 maybesatisfied in the data generatingprocess but not for the observed. • Examples: • Month of the yeardummies/Yeardummies, Country dummies, Genderdummies.

AssumptionA3: Exogeneity • i.e. meanindependence of the residual and the covariates. • E(e|x1,…,xK) = 0. • This is a property of the data generatingprocess. • Link withselectionbiasin Session 1?

Dealing with Endogeneity • You’re assuming that there is no covariate correlated with the Xs that has an effect on Y. • If it is only correlated with X with no effect on Y, it’s OK. • If it is not correlated with X and has an effect on Y, it’s OK. • Example of a problem: • Health and Hospital stays. • What covariate should you add? • Conclusion: Be creative !! Think about unobservables !!

Assumption A4: Homoskedasticity and Non Autocorrelation • Var(e|x1,…,xK) = s2. • Corr(ei,ej|X) = 0. • Visible on a scatterplot? • Link with t-tests of session 2? • Examples: correlated/randomeffects.

Assumption A5Exogenouslygeneratedcovariates • Instead of requiring the meanindependence of the residual and the covariates, wemightrequiretheirindependence. • (Recall X and e independent if f(X,e)=f(X)f(e)) • Sometimeswewillthink of X as fixedratherthanexogenouslygenerated.

Assumption A6: Normality of the Residuals • The asymptoticproperties of OLS (to bediscussedbelow) do not depend on the normality of the residuals: semi-parametricapproach. • But for resultswith a fixednumber of observations, weneed the normality of the residuals for the OLS to have niceproperties (to bedefinedbelow).

Session 3 – Linear Regression 3. The Ordinary Least Squares estimator

The OLS Estimator • Formula: • Twointerpretations: • Minimization of sum of squares (Gauss’sinterpretation). • Coefficient beta whichmakes the observed X and epsilons meanindependent (according to A3).

OLS estimator • Exercise: Find the OLS estimator in the case where both y and x are scalars (i.e. not vectors). Learn the formula by heart (if correct !).

Implementation in Stata • STATA regress command. • regress y x1 x2 x3 x4 x5 … • Whatdoes Stata do? • drops variables that are perfectlycorrelated. (to make sure A2 issatisfied). Always check the number of observations ! • Options willbeseen in the following sessions. • Dummies (e.g. for years) canbeincludedusing « xi: i.year ». Again A2 must besatisfied.

First things first: Desc. Stats • Each variable used in the analysis: Mean, standard deviation for the sample and the subsamples. • Other possible outputs: min max, median (only if you care). • Source of the dataset. • Why?? • Show the reader the variables are “well behaved”: no outlier driving the regression, consistent with intuition. • Number of observations should be constant across regressions (next slide).

Reading a table … from the Levitt paper (2006 wp)

Other important advice • As a best practice always start by regressing y on x with no controls except the most essential ones. • No effect? Then maybe you should think twice about going further. • Then add controls one by one, or group by group. • Explain why coefficient of interest changes from one column to the next. (See next session)

Stata tricks • Output the estimation resultsusingestout or outreg. • Display stars for coefficients’ significance. • Outputs the essential statistics (F, R2, t test). • Stacks the columns of regression output for regressionswithdifferent sets of covariates. • Formats: LaTeX and text (Microsoft Word).

Session 3 – Linear Regression 4. Large sample properties of the ols estimator

The OLS estimatoris CAN • CAN : • Consistent • AsymptoticallyNormal • Proof: • Use ‘true’ relationshipbetween y and X to show that b = b + (1/N (X’X)-1 )(1/N (X’e)). • Use Slutskytheorem and A3 to show consistency. • Use CLT and A3 to show asymptoticnormality. • V = plim (1/N (X’X)) -1

OLS is CAN: numerical simulation • Typical design of a study: • Recruit X% of a population (for instance a randomsample of studentsat INSEAD). • Collect the data. • Perform the regression and get the OLS estimator. • If youperformthesestepsindependentlya large number of times (thoughtexperiment), thenyouwillget a normal distribution of parameters.

Important assumptions • A1, A2, A3 are needed to solve the identification problem: • With them, estimator is consistent. • A4 is needed • A4 affects the variance covariance matrix. • Violations of A3? Next session (identif. Issues) • Violations of A4? Session on inference issues.

Session 3 – Linear Regression 5. Finite sample properties of the ols estimator

The OLS Estimatoris BLUE • BLUE: • Best … i.e. has minimum variance • Linear … i.e. is a linearfunction of the X and Y • Unbiased … i.e. • Estimator … i.e. itisjust a function of the observations • Proof (a.k.a. the Gauss Markov Theorem):

OLS is BLUE • Steps of the proof: • OLS is LUE because of A1 and A3. • OLS is Best… • For any other LUE, such as Cy, CX=Id. • Then take the difference Dy= Cy-b. (b is the OLS) • Show that Var(b0|X) = Var(b|X) + s2 D’D. • The result follows from s2D’D > 0.

Finite sample distribution • The OLS estimator is normally distributed for a fixed N, as long as one assumes the normality of the residuals (A6). • What is “large” N? • Small: e.g. Acemoglu, Johnson and Robinson • Large: e.g. Bennedsen and Perez Gonzalez. • Statistical question: rate of convergence of the law of large numbers.

This is small N

Other examples Large N • Compustat (1,000s + observations) • Execucomp • Scanner data Small N • Cross-country regressions (< 100 points)

Session 3 – Linear Regression 6. Statistics for reading the output of ols estimation

Statistics • R squared • Whatshare of the variance of the outcome variable isexplained by the covariates? • t-test • Is the coefficient on the variable of interestsignificant? • Confidence intervals • Whatintervalincludes the true coefficient withprobability 95%? • F statistic. • Is the model betterthanrandom noise?

Reading Stata Output

R Squared • Measures the share of the variance of Y (the dependent variable) explained by the model Xb, hence R2 = var(Xb)/var(Y). • Note that if you regress Y on itself, the R2 is 100%. The R2 is not a good indicator of the quality of a model.

Tricky Question • Should I choose the model with the highest R squared? • Adding a variable mechanicallyraises the R squared. • A model withendogenous variables (thus not interpretablenor causal) can have a high R square.

Adjusted R-Square • Corrects for the number of variables in the regression. • Proposition: When adding a variable to a regression model, the adjusted R-square increases if and only if the square of the t-statistic is greater than 1. • Adj-R2: arbitrary (1, why 1?) but still interesting.

t-test and p value • p-value: significance level for the coefficient. • Significance at 95% : pvalue lower than 0.05. • Typical value for t is 1.96 (when N is large, t is normal). • Significance at X% : pvalue lower than 1-X. • Important significance levels: 10%, 5%, 1%. • Depending on the size of the dataset. • t-test is valid asymptotically under A1,A2,A3,A4. • t-test is valid at finite distance with A6. • Small sample t-tests… see Wooldridge NBER conference, “Recent advances in Econometrics.”

F Statistic • Is the model as a whole significant? • Hypothesis H0: all coefficients are equal to zero, except the constant. • Alternative hypothesis: at least one coefficient is nonzero. • Under the null hypothesis, in distribution:

Session 3 – Linear Regression 7. Tricky Questions

Tricky Questions • Can I drop a non significant variable? • What if two variables are verystronglycorrelated (but not perfectlycorrelated)? • How do I deal (simply) withmissing/miscoded data? • How do I identifyinfluential observations?

Tricky Questions • Can I drop a non significant variable? • A variable maybe non significant but still have a significantcorrelationwithothercovariates… • Dropping the non significantcovariatemayundulyincrease the significance of the coefficient of interest. (recentlyseen in an OECD workingpaper). • Conclusion: controlsstay.

Tricky Questions • What if two variables are verystronglycorrelated (but not perfectly)? • One coefficient tends to beverysignificant and positive… • While the coefficient of the other variable isverysignificant and negative! • Beware of multicollinearity.

Tricky Questions • How do I deal (simply) with missing data? • Create dummies for missing covariates instead of dropping them from the regression. • If it is the dependent variable, focus on the subset of non missing dependents. • Argue in the paper that it is missing at random (if possible). • For more advanced material, see session on Heckman selection model.

How do I identify influential points? • Run the regression with the dataset except the point in question. • Identify influential observations by making a scatterplot of the dependent variable and the prediction Xb.

Econometrics

Econometrics

Presentation Transcript

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics

Econometrics