460 likes | 725 Views
Non-Experimental Data II What Should Be Included in a Regression? Omitted Variables and Measurement Error. Causal Effects in Bog-standard non-experimental data. Often no clever instrument or natural experiment available Just going to run regression of y on X 1 - and what else?
E N D
Non-Experimental Data IIWhat Should Be Included in a Regression?Omitted Variables and Measurement Error
Causal Effects in Bog-standard non-experimental data • Often no clever instrument or natural experiment available • Just going to run regression of y on X1 - and what else? • What variables should be included is basic day-to-day decision of practising applied economist • Apologies if too basic but it is important • No specific recipe, but general principles
Think of what question you want to answer • Want to estimate E(y|X,?) • Think of what `?’ should be • Returns to education – should you include or exclude occupation? • If include then will improve R2 so occupation is ‘relevant’ • But will be asking question ‘what is effect of education on earnings holding occupation constant – perhaps not what we want
Will focus on econometric issues • What are the issues we need to worry about: • Omitted variables • Measurement error • Will discuss these issues in turn • Slight change in notation to more standard form • Run regression of y on X1, X2, etc – want causal effect of X1 on y
Omitted Variable Issues • Basic Model is: y=X1β1+X2β2+ε • Two issues: • What happens if we include X2 when it is irrelevant (β2=0)? • What happens if we exclude X2 when it is relevant (β2≠0)?
Proposition 4.1If X2 is irrelevant • OLS estimate of β1 is consistent so no problem here (Not surprising – imposing ‘true’ restriction on the data) • But there is a cost – lower precision in estimate of β1
Proof of Proposition 4.1a • Many ways to prove this • Can just read it off from: • Or use result from partitioned regression model:
Proof of Proposition 4.1b - Method 1 • Using results from partitioned regression model can write OLS estimate of β1 as: • This is linear in y but generally different from OLS estimate if X2 excluded • Can invoke Gauss-Markov theorem – OLS estimator is BLUE – note use of irrelevance of X2 here
Proof of Proposition 4.1b - Method 2(X1 and X2 one-dimensional) • This uses results from the notes on experiments. • If exclude X2then variance of coefficient on X1is given by: • If include X2then variance of coefficient on X1is given by: • If X2is irrelevant then σ20=σ2
What determines size of loss of precision? • Bigger the Correlation between X1 andX2 the greater the likely loss in precision. • To see this if X1 andX2 are uncorrelated then two estimates identical • Consider extreme case of perfect correlation – then perfect multicollinearity if X2 included • Also useful to think of Proposition 4.1b as a specific application of the general principle that if we impose a ‘true’ restriction on parameters (here β2=0) then precision of estimation of other parameters improves – a gain in efficiency.
Contrast with earlier result on other variables with experimental data • Here inclusion of irrelevant variables correlated with X reduce precision • Earlier, inclusion of relevant variables uncorrelated with X increases precision • They are consistent: • Including relevant variables increases precision • Including variables correlated with X reduces precision • Ambiguous effect on precision of including relevant variable correlated with X
Excluding Relevant Variables • Leads to Omitted Variable Bias if X1 and X2 are correlated
Is it better to exclude the relevant or include the irrelevant • Omitting relevant variables causes bias • Putting in irrelevant variables causes lower precision • Might conclude better to err on side of caution and include lots of regressors – the ‘kitchen sink’ approach • But: • May be prepared to accept some bias for extra precision • Can worsen problems of measurement error
Measurement Error • True value is X* • True model is: y=X*β+ε • But X* observed with error – observed value is X • Measurement error has classical form: X=X*+u, E(u|X*)=0 • Can write model in terms of observables as: y=Xβ-uβ+ε • X correlated with composite error (uβ +ε) so bias in OLS estimate
Proposition 4.2With one regressor (with classical measurement error) the plim of the slope coefficient is: • OLS estimate is biased towards zero – this is attenuation bias • Extent of bias related to importance of measurement error – signal-to-noise ratio, reliability ratio
The General Case • Have previous model but now have X to be more than one-dimensional • Some notation and assumptions: • Covariance matrix of u is Σ
Proposition 4.3The plim of the OLS estimator with many error-ridden regressors
Matrix equivalent of attenuation bias • But, in general case, hard to say anything about direction of bias on any single coefficient • If ΣXX* and Σ both diagonal then all coefficients biased towards zero
An Informative Special Case • Two variables One variable measured with error, the other measured without error
Proposition 4.4Attenuation Bias of Error-Ridden Variable Worsens when other Variables are Included • Where ρ12 is correlation between X1 and X2 • If ρ12≠0 this attenuation bias is worse than when X2 excluded. • Intuition: X2 soaks up some of signal in X1 leaving more noise in what remains
Proposition 4.5The Presence of Error-Ridden Variables Causes Inconsistency on the Coefficients of Other Variables • This is inconsistent if X1 and X2 are correlated (σ12≠0) • Mirror image of previous result - X2 soaks up some of true variation in X1
An Extreme Case • Observed X1 is all noise, σ2u=∞ - its coefficient will be zero • Then we get: • Should recognise this as formula for omitted variable bias when X1 excluded
Measurement error in theDependent Variable • Suppose classical measurement error in y y=y*+u • Assume u uncorrelated with y,X • Then: y=Xβ+ε+u • X uncorrelated with u so OLS consistent • But is loss in precision so still a cost to bad data
Example 2: Including Variables as a Higher Level of Aggregation • X* is individual level variable • Only observe average value at some higher level of aggregation (e.g. village, industry,region) - call this X • Model for relationship between X and X* X*=X+u, E(u│X)=0 • Note change in format
In regression we have: y=X*β+ε y=Xβ+ε+u β • X and u uncorrelated so no inconsistency in OLS estimate of coefficient • But not ideal: • Loss in precision – less variation • Limits way to model other higher-level variables • May cause inconsistency in coefficients on other variables as E(u│X,Z) will depend on Z
Summary of results on omitted variables and measurement error • Including irrelevant variables leads to loss in precision • Excluding relevant variables leads to omitted variables bias • Measurement error in X variables typically causes attenuation bias in coefficients • Inclusion of other variables worsens attenuation bias (though may reduce omitted variables bias)
Strategies For Omitted Variables/Measurement Error • One strategy for dealing with omitted variables is to get data on variable and include it • One strategy for dealing with measurement error variables is to get better-quality data • These are good strategies but may be easier said than done • IV offers another approach if instrument can be argued to be correlated with true value of variable of interest and not with measurement error/ omitted variable
Clustered Standard Errors • In many situations individuals affected by variables that operate at a higher level e.g. industry, region, economy • Call this higher-level a group or cluster • Can include group-level variables in regression • May be difficult to control for all relevant group-level variables so common practice to include dummy variable for each group • These dummy variables will capture the impact of all group-level variables
Can write this model as.. • Where D is (NxG) matrix of group-dummies and θ vector of group-level effects (assume mean zero) • Will often see this but: • Low precision if number of groups is large (only exploits within-group variation in X) • Can’t identify effect of group-level variable X
Lets think some more about this case.. • Might think about dropping group-level dummies and simply estimating: y=Xβ+ε • But this assumes covariance between residuals of individuals in the same group is zero – this is very strong • Half-way house is to think of θ not as parameters to be estimated but ‘errors’ that operate at level of the group • Assume θ uncorrelated with X,ε
An Error Component Model • Error for individual i can be written as: • Variance of this error is: • Correlation between errors for individuals in same group (zero for those not in same group):
Why is this? • For individuals in the same group • As they have the same group-level component • For individuals in different groups covariance is zero as have different (and assumed independent) group-level component
Implications • Covariance matrix of composite errors, ui, will no longer be diagonal - denote by σ2Ω • OLS estimate will still be consistent (though not efficient) • Computed standard errors will be inconsistent – should be computed by:
With this particular error component model • i.e. usual formula plus something • Usual formula will be wrong if second term non-zero
Can say more…. • (X’D) will be a (kxG) matrix whose kth row and gth column will consist of the sum of values of Xk for those in group g • Suppose all groups equal size and Ng=(N/G) • Define a (Gxk) matrix of the average values of X in each group:
Using this in previous expression.. • For case of one regressor the variance of the slope coefficient will be: • Where Var(Xi) is variance in X across individuals, Var(Xg) is variance in X across groups
Case I: X correlation within and between groups the same • i.e. usual formula correct • Implies no (or small) problem with standard errors for variables which do not have much group-level variation
Case II: Group-Level Regressor • Standard formula understates true variance by a factor related to importance of group-level shock and the size of the groups
An even more special case.. • All individuals within groups are clones – ρ=1 • Then: • Really only have G observations • Simplest to estimate at group-level • But group-level estimation generally causes loss in efficiency so not best solution
Dealing with this in practice.. • STATA has an option to compute standard errors doing clustering: . reg y x1 x2, cl(x3) • Such standard errors are said to be clustered with the ‘cluster’ being x3 • So quite easy to do in practice
An example – the effect of gender and regional unemployment on wages • Data from UK LFS • Would expect gender mix not to vary much between region so most variation within region • Unemployment rate only has variation at regional level • Would expect clustering to reduce standard error on gender only a little but u-rate a lot
No clustering logwage | Coef. Std. Err. t -------------+--------------------------------- sex | -.2285092 .0091228 -25.05 urate | 1.057465 .3928981 2.69 _cons | 2.447221 .0228265 107.21 -----------------------------------------------
With clustered standard errors | Robust logwage | Coef. Std. Err. t -------------+--------------------------------- sex | -.2285092 .0110932 -20.60 urate | 1.057465 2.943567 0.36 _cons | 2.447221 .1494707 16.37 ----------------------------------------------- As predicted by theory
Conclusions • Good practice to cluster the standard errors if not going to include group-level dummies • This is particularly important for group-level regressors – standard errors will otherwise often be much too low