560 likes | 709 Views
Econometrics. Session 3 – Linear Regression Amine Ouazad , Asst. Prof. of Economics. Econometrics. Session 3 – Linear Regression Amine Ouazad , Asst. Prof. of Economics. Outline of the course. Introduction: Identification Introduction: Inference Linear Regression
E N D
Econometrics Session 3 – Linear Regression Amine Ouazad, Asst. Prof. of Economics
Econometrics Session 3 – Linear Regression Amine Ouazad, Asst. Prof. of Economics
Outline of the course • Introduction: Identification • Introduction: Inference • Linear Regression • Identification Issues in Linear Regressions • Inference Issues in Linear Regressions
This sessionIntroduction: Linear Regression • What is the effect of X on Y? • Hands-on problems: • What is the effect of the death of the CEO (X) on firm performance (Y)? (Morten Bennedsen) • What is the effect of child safety seats (X) on the probability of death (Y)? (Steve Levitt)
This session:Linear Regression • Notations. • Assumptions. • The OLS estimator. • Implementation in STATA. • The OLS estimator is CAN.Consistent and Asymptotically Normal • The OLS estimator is BLUE.*Best Linear Unbiased Estimator (BLUE)* • Essential statistics: t-stat, R squared, Adjusted R Squared, F stat, Confidence intervals. • Tricky questions. *Conditions apply
Session 3 – Linear Regression 1. Notations
Notations • The effect of X on Y. • Whatis X? • K covariates (including the constant) • N observations • X is an NxK matrix. • Whatis Y? • N observations. • Y is an N-vector.
Notations • Relationship between y and the xs. y=f(x1,x2,x3,x4,…,xK)+e • f: a function K variables. • e: the unobservables (a scalar).
Session 3 – Linear Regression 2. Assumptions
Assumptions • A1: Linearity • A2: Full Rank • A3: Exogeneity of the covariates • A4: Homoskedasticity and nonautocorrelation • A5: Exogenouslygeneratedcovariates. • A6: Normality of the residuals
Assumption A1: Linearity • y = f(x1,x2,x3,…,xK)+e • y = x1 b1 + x2 b2 + …+xKbK + e • In ‘plain English’: • The effect of xkis constant. • The effect of xkdoes not depend on the value of xk’. • Not satisfied if : • squares/higherpowersof x matter. • Interaction termsmatter.
Notations • Data generating process • Scalar notation • Matrix version #1 • Matrix version #2
Assumption A2: Full Rank • We assume that X’X isinvertible. • Notes: • A2 maybesatisfied in the data generatingprocess but not for the observed. • Examples: • Month of the yeardummies/Yeardummies, Country dummies, Genderdummies.
AssumptionA3: Exogeneity • i.e. meanindependence of the residual and the covariates. • E(e|x1,…,xK) = 0. • This is a property of the data generatingprocess. • Link withselectionbiasin Session 1?
Dealing with Endogeneity • You’re assuming that there is no covariate correlated with the Xs that has an effect on Y. • If it is only correlated with X with no effect on Y, it’s OK. • If it is not correlated with X and has an effect on Y, it’s OK. • Example of a problem: • Health and Hospital stays. • What covariate should you add? • Conclusion: Be creative !! Think about unobservables !!
Assumption A4: Homoskedasticity and Non Autocorrelation • Var(e|x1,…,xK) = s2. • Corr(ei,ej|X) = 0. • Visible on a scatterplot? • Link with t-tests of session 2? • Examples: correlated/randomeffects.
Assumption A5Exogenouslygeneratedcovariates • Instead of requiring the meanindependence of the residual and the covariates, wemightrequiretheirindependence. • (Recall X and e independent if f(X,e)=f(X)f(e)) • Sometimeswewillthink of X as fixedratherthanexogenouslygenerated.
Assumption A6: Normality of the Residuals • The asymptoticproperties of OLS (to bediscussedbelow) do not depend on the normality of the residuals: semi-parametricapproach. • But for resultswith a fixednumber of observations, weneed the normality of the residuals for the OLS to have niceproperties (to bedefinedbelow).
Session 3 – Linear Regression 3. The Ordinary Least Squares estimator
The OLS Estimator • Formula: • Twointerpretations: • Minimization of sum of squares (Gauss’sinterpretation). • Coefficient beta whichmakes the observed X and epsilons meanindependent (according to A3).
OLS estimator • Exercise: Find the OLS estimator in the case where both y and x are scalars (i.e. not vectors). Learn the formula by heart (if correct !).
Implementation in Stata • STATA regress command. • regress y x1 x2 x3 x4 x5 … • Whatdoes Stata do? • drops variables that are perfectlycorrelated. (to make sure A2 issatisfied). Always check the number of observations ! • Options willbeseen in the following sessions. • Dummies (e.g. for years) canbeincludedusing « xi: i.year ». Again A2 must besatisfied.
First things first: Desc. Stats • Each variable used in the analysis: Mean, standard deviation for the sample and the subsamples. • Other possible outputs: min max, median (only if you care). • Source of the dataset. • Why?? • Show the reader the variables are “well behaved”: no outlier driving the regression, consistent with intuition. • Number of observations should be constant across regressions (next slide).
Other important advice • As a best practice always start by regressing y on x with no controls except the most essential ones. • No effect? Then maybe you should think twice about going further. • Then add controls one by one, or group by group. • Explain why coefficient of interest changes from one column to the next. (See next session)
Stata tricks • Output the estimation resultsusingestout or outreg. • Display stars for coefficients’ significance. • Outputs the essential statistics (F, R2, t test). • Stacks the columns of regression output for regressionswithdifferent sets of covariates. • Formats: LaTeX and text (Microsoft Word).
Session 3 – Linear Regression 4. Large sample properties of the ols estimator
The OLS estimatoris CAN • CAN : • Consistent • AsymptoticallyNormal • Proof: • Use ‘true’ relationshipbetween y and X to show that b = b + (1/N (X’X)-1 )(1/N (X’e)). • Use Slutskytheorem and A3 to show consistency. • Use CLT and A3 to show asymptoticnormality. • V = plim (1/N (X’X)) -1
OLS is CAN: numerical simulation • Typical design of a study: • Recruit X% of a population (for instance a randomsample of studentsat INSEAD). • Collect the data. • Perform the regression and get the OLS estimator. • If youperformthesestepsindependentlya large number of times (thoughtexperiment), thenyouwillget a normal distribution of parameters.
Important assumptions • A1, A2, A3 are needed to solve the identification problem: • With them, estimator is consistent. • A4 is needed • A4 affects the variance covariance matrix. • Violations of A3? Next session (identif. Issues) • Violations of A4? Session on inference issues.
Session 3 – Linear Regression 5. Finite sample properties of the ols estimator
The OLS Estimatoris BLUE • BLUE: • Best … i.e. has minimum variance • Linear … i.e. is a linearfunction of the X and Y • Unbiased … i.e. • Estimator … i.e. itisjust a function of the observations • Proof (a.k.a. the Gauss Markov Theorem):
OLS is BLUE • Steps of the proof: • OLS is LUE because of A1 and A3. • OLS is Best… • For any other LUE, such as Cy, CX=Id. • Then take the difference Dy= Cy-b. (b is the OLS) • Show that Var(b0|X) = Var(b|X) + s2 D’D. • The result follows from s2D’D > 0.
Finite sample distribution • The OLS estimator is normally distributed for a fixed N, as long as one assumes the normality of the residuals (A6). • What is “large” N? • Small: e.g. Acemoglu, Johnson and Robinson • Large: e.g. Bennedsen and Perez Gonzalez. • Statistical question: rate of convergence of the law of large numbers.
Other examples Large N • Compustat (1,000s + observations) • Execucomp • Scanner data Small N • Cross-country regressions (< 100 points)
Session 3 – Linear Regression 6. Statistics for reading the output of ols estimation
Statistics • R squared • Whatshare of the variance of the outcome variable isexplained by the covariates? • t-test • Is the coefficient on the variable of interestsignificant? • Confidence intervals • Whatintervalincludes the true coefficient withprobability 95%? • F statistic. • Is the model betterthanrandom noise?
R Squared • Measures the share of the variance of Y (the dependent variable) explained by the model Xb, hence R2 = var(Xb)/var(Y). • Note that if you regress Y on itself, the R2 is 100%. The R2 is not a good indicator of the quality of a model.
Tricky Question • Should I choose the model with the highest R squared? • Adding a variable mechanicallyraises the R squared. • A model withendogenous variables (thus not interpretablenor causal) can have a high R square.
Adjusted R-Square • Corrects for the number of variables in the regression. • Proposition: When adding a variable to a regression model, the adjusted R-square increases if and only if the square of the t-statistic is greater than 1. • Adj-R2: arbitrary (1, why 1?) but still interesting.
t-test and p value • p-value: significance level for the coefficient. • Significance at 95% : pvalue lower than 0.05. • Typical value for t is 1.96 (when N is large, t is normal). • Significance at X% : pvalue lower than 1-X. • Important significance levels: 10%, 5%, 1%. • Depending on the size of the dataset. • t-test is valid asymptotically under A1,A2,A3,A4. • t-test is valid at finite distance with A6. • Small sample t-tests… see Wooldridge NBER conference, “Recent advances in Econometrics.”
F Statistic • Is the model as a whole significant? • Hypothesis H0: all coefficients are equal to zero, except the constant. • Alternative hypothesis: at least one coefficient is nonzero. • Under the null hypothesis, in distribution:
Session 3 – Linear Regression 7. Tricky Questions
Tricky Questions • Can I drop a non significant variable? • What if two variables are verystronglycorrelated (but not perfectlycorrelated)? • How do I deal (simply) withmissing/miscoded data? • How do I identifyinfluential observations?
Tricky Questions • Can I drop a non significant variable? • A variable maybe non significant but still have a significantcorrelationwithothercovariates… • Dropping the non significantcovariatemayundulyincrease the significance of the coefficient of interest. (recentlyseen in an OECD workingpaper). • Conclusion: controlsstay.
Tricky Questions • What if two variables are verystronglycorrelated (but not perfectly)? • One coefficient tends to beverysignificant and positive… • While the coefficient of the other variable isverysignificant and negative! • Beware of multicollinearity.
Tricky Questions • How do I deal (simply) with missing data? • Create dummies for missing covariates instead of dropping them from the regression. • If it is the dependent variable, focus on the subset of non missing dependents. • Argue in the paper that it is missing at random (if possible). • For more advanced material, see session on Heckman selection model.
How do I identify influential points? • Run the regression with the dataset except the point in question. • Identify influential observations by making a scatterplot of the dependent variable and the prediction Xb.