Non-Experimental Data II What Should Be Included in a Regression? Omitted Variables and Measurement Error

Non-Experimental Data IIWhat Should Be Included in a Regression?Omitted Variables and Measurement Error

Causal Effects in Bog-standard non-experimental data • Often no clever instrument or natural experiment available • Just going to run regression of y on X1 - and what else? • What variables should be included is basic day-to-day decision of practising applied economist • Apologies if too basic but it is important • No specific recipe, but general principles

Think of what question you want to answer • Want to estimate E(y|X,?) • Think of what `?’ should be • Returns to education – should you include or exclude occupation? • If include then will improve R2 so occupation is ‘relevant’ • But will be asking question ‘what is effect of education on earnings holding occupation constant – perhaps not what we want

Will focus on econometric issues • What are the issues we need to worry about: • Omitted variables • Measurement error • Will discuss these issues in turn • Slight change in notation to more standard form • Run regression of y on X1, X2, etc – want causal effect of X1 on y

Omitted Variable Issues • Basic Model is: y=X1β1+X2β2+ε • Two issues: • What happens if we include X2 when it is irrelevant (β2=0)? • What happens if we exclude X2 when it is relevant (β2≠0)?

Proposition 4.1If X2 is irrelevant • OLS estimate of β1 is consistent so no problem here (Not surprising – imposing ‘true’ restriction on the data) • But there is a cost – lower precision in estimate of β1

What determines size of loss of precision? • Bigger the Correlation between X1 andX2 the greater the likely loss in precision. • To see this if X1 andX2 are uncorrelated then two estimates identical • Consider extreme case of perfect correlation – then perfect multicollinearity if X2 included • Also useful to think of Proposition 4.1b as a specific application of the general principle that if we impose a ‘true’ restriction on parameters (here β2=0) then precision of estimation of other parameters improves – a gain in efficiency.

Contrast with earlier result on other variables with experimental data • Here inclusion of irrelevant variables correlated with X reduce precision • Earlier, inclusion of relevant variables uncorrelated with X increases precision • They are consistent: • Including relevant variables increases precision • Including variables correlated with X reduces precision • Ambiguous effect on precision of including relevant variable correlated with X

Excluding Relevant Variables • Leads to Omitted Variable Bias if X1 and X2 are correlated

Is it better to exclude the relevant or include the irrelevant • Omitting relevant variables causes bias • Putting in irrelevant variables causes lower precision • Might conclude better to err on side of caution and include lots of regressors – the ‘kitchen sink’ approach • But: • May be prepared to accept some bias for extra precision • Can worsen problems of measurement error

Measurement Error • True value is X* • True model is: y=X*β+ε • But X* observed with error – observed value is X • Measurement error has classical form: X=X*+u, E(u|X*)=0 • Can write model in terms of observables as: y=Xβ-uβ+ε • X correlated with composite error (uβ +ε) so bias in OLS estimate

Proposition 4.2With one regressor (with classical measurement error) the plim of the slope coefficient is: • OLS estimate is biased towards zero – this is attenuation bias • Extent of bias related to importance of measurement error – signal-to-noise ratio, reliability ratio

The General Case • Have previous model but now have X to be more than one-dimensional • In general case, hard to say anything about direction of bias on any single coefficient (proposition 4.3) • Perhaps a ‘good guess’ that there is attenuation bias

An Informative Special Case • Two variables One variable measured with error, the other measured without error • Assume covariance matrix of measurement error and true values of X* is given by:

Then can show that:

Proposition 4.4Attenuation Bias of Error-Ridden Variable Worsens when other Variables are Included • Where ρ12 is correlation between X1 and X2 • If ρ12≠0 this attenuation bias is worse than when X2 excluded. • Intuition: X2 soaks up some of signal in X1 leaving more noise in what remains

Proposition 4.5The Presence of Error-Ridden Variables Causes Inconsistency on the Coefficients of Other Variables • This is inconsistent if X1 and X2 are correlated (σ12≠0) • Mirror image of previous result - X2 soaks up some of true variation in X1

An Extreme Case • Observed X1 is all noise, σ2u=∞ - its coefficient will be zero • Then we get: • Should recognise this as formula for omitted variable bias when X1 excluded

Measurement error in theDependent Variable • Suppose classical measurement error in y y=y*+u • Assume u uncorrelated with y,X • Then: y=Xβ+ε+u • X uncorrelated with u so OLS consistent • But is loss in precision so still a cost to bad data

Example 2: Including Variables as a Higher Level of Aggregation • X* is individual level variable • Only observe average value at some higher level of aggregation (e.g. village, industry,region) - call this X • Model for relationship between X and X* X*=X+u, E(u│X)=0 • Note change in format

In regression we have: y=X*β+ε y=Xβ+ε+u β • X and u uncorrelated so no inconsistency in OLS estimate of coefficient • But not ideal: • Loss in precision – less variation • Limits way to model other higher-level variables • May cause inconsistency in coefficients on other variables as E(u│X,Z) will depend on Z

Summary of results on omitted variables and measurement error • Including irrelevant variables leads to loss in precision • Excluding relevant variables leads to omitted variables bias • Measurement error in X variables typically causes attenuation bias in coefficients • Inclusion of other variables worsens attenuation bias (though may reduce omitted variables bias)

Strategies For Omitted Variables/Measurement Error • One strategy for dealing with omitted variables is to get data on variable and include it • One strategy for dealing with measurement error variables is to get better-quality data • These are good strategies but may be easier said than done • IV offers another approach if instrument can be argued to be correlated with true value of variable of interest and not with measurement error/ omitted variable

Clustered Standard Errors • In many situations individuals affected by variables that operate at a higher level e.g. industry, region, economy • Call this higher-level a group or cluster • Can include group-level variables in regression • May be difficult to control for all relevant group-level variables so common practice to include dummy variable for each group • These dummy variables will capture the impact of all group-level variables

Can write this model as.. • Where D is (NxG) matrix of group-dummies and θ vector of group-level effects (assume mean zero) • Will often see this but: • Low precision if number of groups is large (only exploits within-group variation in X) • Can’t identify effect of group-level variable X

Lets think some more about this case.. • Might think about dropping group-level dummies and simply estimating: y=Xβ+ε • But this assumes covariance between residuals of individuals in the same group is zero – this is very strong • Half-way house is to think of θ not as parameters to be estimated but ‘errors’ that operate at level of the group • Assume θ uncorrelated with X,ε

An Error Component Model • Error for individual i can be written as: • Variance of this error is: • Correlation between errors for individuals in same group (zero for those not in same group):

Why is this? • For individuals in the same group • As they have the same group-level component • For individuals in different groups covariance is zero as have different (and assumed independent) group-level component

Implications • Covariance matrix of composite errors, ui, will no longer be diagonal - denote by σ2Ω • OLS estimate will still be consistent (though not efficient) • Computed standard errors will be inconsistent – should be computed by:

Dealing with this in practice.. • STATA has an option to compute standard errors doing clustering: . reg y x1 x2, cl(x3) • Such standard errors are said to be clustered with the ‘cluster’ being x3 • So quite easy to do in practice

An example – the effect of gender and regional unemployment on wages • Data from UK LFS • Would expect gender mix not to vary much between region so most variation within region • Unemployment rate only has variation at regional level • Would expect clustering to reduce standard error on gender only a little but u-rate a lot

No clustering logwage | Coef. Std. Err. t -------------+--------------------------------- sex | -.2285092 .0091228 -25.05 urate | 1.057465 .3928981 2.69 _cons | 2.447221 .0228265 107.21 -----------------------------------------------

With clustered standard errors | Robust logwage | Coef. Std. Err. t -------------+--------------------------------- sex | -.2285092 .0110932 -20.60 urate | 1.057465 2.943567 0.36 _cons | 2.447221 .1494707 16.37 ----------------------------------------------- As predicted by theory

Conclusions • Good practice to cluster the standard errors if not going to include group-level dummies • This is particularly important for group-level regressors – standard errors will otherwise often be much too low

Non-Experimental Data II What Should Be Included in a Regression? Omitted Variables and Measurement Error