490 likes | 705 Views
Research Method. Lecture 11-2 (Ch15) Instrumental Variables Estimation and Two Stage Least Square. What would happen when you use IV method when the suspected endogenous variable is in fact exogenous?. Consider the following model Y= β 0 + β 1 x+u
E N D
Research Method Lecture 11-2 (Ch15) Instrumental Variables Estimation and Two Stage Least Square
What would happen when you use IV method when the suspected endogenous variable is in fact exogenous? • Consider the following model Y=β0+β1x+u If x is exogenous, you do not need IV method. OLS estimators are consistent. Suppose that you have an instrument for x, called z, which satisfies the instrument conditions (instrument exogeneity and instrument relevance described in handout 11-1). Then, IV estimators are also consistent. Then, which one is better, OLS or IV?
Answer is, OLS. If x is exogenous, IV estimators have larger variances, so IV estimators are imprecise (you tend to get smaller t-stat in absolute value.) • To see this, notice the following. • Since R2x,z is always between 0 and 1 (except the case x=z, where it is 1), the variance of IV estimator is always bigger asymptotically).
Thus, controlling for endogeneity(i.e., using IV method) when it is actually exogenous is costly in terms of precision.
Poor instruments: What would happen if the instrumental variable does not satisfy the instrument conditions. • Consider the following model Y=β0+β1x+u • This time, suppose that x is endogenous. But further suppose that your instrumental variable z does not satisfy the instrument conditions (i.e., you have a poor instrument). • Then what would happen?
Answer to this question is the following • IV estimators are inconsistent. • The directions of the biases in IV estimators and OLS estimators can be the opposite. • The bias in IV can be worse than OLS.
If instrument exogeneity is not satisfied, this term is not zero, so inconsistent. • To understand 1, notice that (Proof: See the front board) So, both IV and OLS are inconsistent. If x is endogenous, this term is not zero, so inconsistent.
To understand 2, first consider that Corr(x,u) is a positive. Then OLS has positive bias. • But it can happen that Corr(z,u)/Corr(z,x) is negative. In such a case, the IV estimator have a negative bias. • This means that, when you have an invalid instrument, you may get very unexpected results.
To understand 3, consider the following scenario. (i) the instrument exogeneity is almost satisfied but not perfectly statisfied, that is; corr(z,u) is close to 0 but not exactly 0. (ii) The instrument is not very relevant; i.e., corr(z, x) is very close to 0. • Then, even if instrument exogeneity is almost satisfied, the bias will be magnified by the small corr(z,x). If this is small, bias will be magnified.
It is possible that the bias is so magnified that the extent of bias in IV estimator is worse than OLS.
IV estimation of the multiple regression model • I will extend the discussion to the multiple regression model. • I will explain the following 3 cases, step by step. Case 1: One endogenous variable, one instrument. Case 2: One endogenous variable, more than one instruments. (Two stage least squares) Case 3: More than one endogenous variables, more than one instruments. (Two stage least squares)
Case 1: One endogenous variable, one instrument. • Consider the following regression. • Suppose that educ is endogenous but exp is exogenous.
To explain IV regression for multiple regression, it is often useful to use different notations for endogenous end exogenous variable. • Let us use y for endogenous variable (i.e., correlated with u) and z for exogenous variables (i.e., uncorreated with u). • Then, we can write the model as: y1=β0+β1y2+β2z1+u …………………(1) y1 is log(wage), y2 is educ, and z1 is exp.
This model is called the structural equation to emphasize that this equation shows the causal relationship. Off course, OLS cannot be used to consistently estimate the parameters since y2 is endogenous. • If you have an instrument for y2, you can consistently estimate the model. Let us call this instrument, z2.
As before, z2 should satisfy (i) instrument exogeneity, and (ii) instrument relevance. • For a multiple regression model, these conditions are written as: 1. The instrument exogeneity Cov(z2, u)=0 …………………….(2) 2. The instrument relevance y2=π0+π1z1+π2z2+error …………….(3) and π2≠0 • In addition, z2 should not be a part of the structural equation (1). This is called the exclusion restriction. All the exogenous variables included. This equation is often called the reduced form equation.
Now, we have the following three conditions that can be used to obtain the IV estimators. E(u)=0 Cov(z1,u)=0 Cov(z2,u)=0 (this is from the instrument exogeneity) The sample counterparts of these conditions are given in the next slide.
If you divide it by n, this is the sample average of . If you divide it by n-1, this is the sample covariance between z1 and . • This is a set of three equations with three unknowns: • The solutions to these equations are the IV estimators. • There is a simple matrix expression for IV estimators. However, we will not cover this during the class. If you divide it by n-1, this is the sample covariance between z2 and .
Above method can be easily extended to the case where there are more explanatory variables (but only one endogenous variable). • Consider the following model. y1=β0+β1y2+β2z1+β3z2+β4z3+..+ βkzk-1+ u • Suppose that zk is the instrument for y2. Then the IV estimators are the solution to the following equations.
Solution to the above equations are the IV estimators when there are many explanatory variables, but only one endogenous variable and one instrument.
Example • Consider the following model. Log(wage)=β0+β1(educ)+β2Exper+β3Exper2 +β3(SMSA)+ β3(South)+u Using the college proximity (nearc4) as an IV for education, estimate the model. Use CARD.dta. (nearc4) is a dummy variable for someone who grew up near a four-year college.
OLS IV
Check if nearc4 satisfies instrument relevance. Using t-test, we can reject the null hypothesis that nearc4 is not correlated with educ after controlling for all other exogenous variables.
Case 2: One endogenous variable, more than one instruments.Two stage least squares • Consider the following model with one endogenous variable. y1=β0+β1y2+β2z1+u • Now, suppose that you have two instruments for y2 that satisfy the instrument conditions. Call them z2 and z3.
You can apply IV method using either z2 or z3. But this produces two different estimators. Moreover, they are not efficient. • Now, I will show you a more efficient estimator. • First, it is important to lay out the instrument conditions.
For z2 and z3 to be valid instruments, they have to satisfy the following two conditions. • Instrument exogeneity Cov(z2, u)=0 and Cov(z3, u)=0 • Instrument relevance y2=π0+π1z1+ π2z2+ π3z3+error and π2≠0 orπ3≠0 In addition, z2 and z3 should not be a part of the structural equation. These are called the exclusion restrictions. Include all the exogenous variables
Now, I will explain the estimation method. • Instead of using only one instrument, we use a linear combination of z2 and z3 as the instrument. • Since a linear combination of z2 and z3 also satisfies the instrument conditions, this is a valid method. • The question is how to find the best linear combination of z2 and z3.
It turns out that OLS regression of the following model provides the best linear combination. y2=π0+π1z1+ π2z2+ π3z3+error • After you estimate this model, you get the predicted value of y2. • Since is a combination of variables which are not correlated with u, is not correlated with u as well. At the same time, is correlate with y2. Thus this is a valid instrument.
Thus, we have the following three conditions that can be used to derive an IV estimator. E(u)=0 Cov(z1,u)=0 Cov( ,u)=0 The sample counter part of the above equations are given by:
This is a set of three equations with three unknowns . • Solution to these equations are special type of IV estimators called the two stage least square estimators.
You can estimate these parameters by following the above procedure. • There is an alternative and equivalent procedure to estimate these parameters. This procedure will give you an idea why it is called the two stage least squares.
The estimation procedures of the two stage least square (2SLS). Stage 1. Estimate the following model using OLS and get the predicted value for y2: . Stage 2. replace y2 with , then estimate the following model using OLS. OLS estimators of the coefficients are the two stage least square estimators (2SLS). Make sure to put all the exogenous variables y2=π0+π1z1+ π2z2+ π3z3+error
Estimating the standard errors for two stage least square. • When you exactly follow the two stage procedures explained in the previous slide, you get correct 2SLS coefficients. But you don’t get correct standard errors. • So, after applying the procedure, you have to do some extra work to estimate the standard errors. • Under the homoskedasticity assumption, the valid standard errors are computed as follows/
Note you use y2, not . Coefficients are 2SLS estimates. • Estimate the 2SLS coefficients, then estimate the variance of u as where 2. Then the variance for βj is given by where is the total variation of . is the R-squared from regressing on all other exogenous variables appearing in the structural equation.
The square root of the variance in the previous slide is the standard error for βj.
Note • STATA automatically estimate 2SLS model, as well as calculating the correct standard errors. • Most of the cases, you should avoid estimating 2SLS “manually” (although it is a good exercise), since this does not provide you with the correct standard errors.
Exercise • Consider the following model. Log(wage)=β0+β1(educ)+β2Exper+β3Exper2+u • Suppose educ is endogenous but exper and its square are exogenous. Using mother and father’s education as instruments, estimate the 2SLS model. Use Mroz.dta. • Manually estimate the model to check if you get the same coefficients. (Note that you will not get the correct standard errors.)
“first” option show s first stage and second stage First stage regression 2SLS results
Estimating 2SLS manually: When you regress the first stage manually on this data, more observations are used than the above 2SLS. To use exactly the same observations, first run the 2SLS and find the observations used in the regression. e(sample) enable you to create dummy if the observation is used
Then, estimate the first stage regression. Note “if fullsample==1” tells STATA to use observations only if fullsample is 1. After estimation, type this command. This will automatically create the predicted value of educ.
Finally estimate the second stage regression. You can see that the coefficient s are the same as before, but Std error and t-stats are different.
Case 3: More than one endogenous variables, more than one instruments • Consider the following structural equation. y1=β0+β1y2+β2y3+β3z1+β4z2+β5z3+u1 There are two endogenous variables, y2 and y3. Thus, OLS will be biased. In order to estimate this model with IV method, you need at least 2 instruments. When you have multiple endogenous variables, you need at least the same number of instruments as the endogenous variables.
Suppose you have 3 instruments: z4 z5 z6. As usual, these instruments should satisfy 2 conditions. The first is that they should not be correlated with u1 (Instrument exogeneity). The second is that they should be correlated with endogenous variable (instrument relevance). When you have multiple endogenous variables, the second condition has a more complex expression, and it is called the rank condition.
The estimation procedure • The 2SLS procedure when there are more than one endogenous variables is shown here. • y1=β0+β1y2+β2y3+β3z1+β4z2+β5z3+u1 Suppose you have three Instruments : z4 z5 z6.
First stage: Estimate the following two reduced from regressions y2=п10+п11z1+п12z2+п13z3+п14z4+п15z5+п16z6+error y3=п20+п21z1+п22z2+п23z3+п24z4+п25z5+п26z6+error Then obtain and . • The second stage: Estimate the following ‘second stage regression’. y1=β0+β1 +β2 +β3z1+β4z2+β5z3+u1 The estimated coefficients are the 2SLS coefficients.
Note that second stage regression does not produce correct standard errors. The derivation of the exact formula for the standard errors is not the focus of this course. Stata ivregress command automatically computes the correct standard errors.
Testing multiple hypotheses • In the 2SLS method, the F statistic formula we used for OLS is no longer valid. STATA automatically computes a valid F-type statistic for 2SLS.