Non-Experimental Data II What Should Be Included in a Regression? Omitted Variables and Measurement Error

Non-Experimental Data IIWhat Should Be Included in a Regression?Omitted Variables and Measurement Error

Causal Effects in Bog-standard non-experimental data • Often no clever instrument or natural experiment available • Just going to run regression of y on X1 - and what else? • What variables should be included is basic day-to-day decision of practising applied economist • Apologies if too basic but it is important • No specific recipe, but general principles

Think of what question you want to answer • Want to estimate E(y|X,?) • Think of what `?’ should be • Returns to education – should you include or exclude occupation? • If include then will improve R2 so occupation is ‘relevant’ • But will be asking question ‘what is effect of education on earnings holding occupation constant – perhaps not what we want

Will focus on econometric issues • What are the issues we need to worry about: • Omitted variables • Measurement error • Will discuss these issues in turn • Slight change in notation to more standard form • Run regression of y on X1, X2, etc – want causal effect of X1 on y

Omitted Variable Issues • Basic Model is: y=X1β1+X2β2+ε • Two issues: • What happens if we include X2 when it is irrelevant (β2=0)? • What happens if we exclude X2 when it is relevant (β2≠0)?

Proposition 4.1If X2 is irrelevant • OLS estimate of β1 is consistent so no problem here (Not surprising – imposing ‘true’ restriction on the data) • But there is a cost – lower precision in estimate of β1

Proof of Proposition 4.1a • Many ways to prove this • Can just read it off from: • Or use result from partitioned regression model:

Proof of Proposition 4.1b - Method 1 • Using results from partitioned regression model can write OLS estimate of β1 as: • This is linear in y but generally different from OLS estimate if X2 excluded • Can invoke Gauss-Markov theorem – OLS estimator is BLUE – note use of irrelevance of X2 here

Proof of Proposition 4.1b - Method 2(X1 and X2 one-dimensional) • This uses results from the notes on experiments. • If exclude X2then variance of coefficient on X1is given by: • If include X2then variance of coefficient on X1is given by: • If X2is irrelevant then σ20=σ2

What determines size of loss of precision? • Bigger the Correlation between X1 andX2 the greater the likely loss in precision. • To see this if X1 andX2 are uncorrelated then two estimates identical • Consider extreme case of perfect correlation – then perfect multicollinearity if X2 included • Also useful to think of Proposition 4.1b as a specific application of the general principle that if we impose a ‘true’ restriction on parameters (here β2=0) then precision of estimation of other parameters improves – a gain in efficiency.

Contrast with earlier result on other variables with experimental data • Here inclusion of irrelevant variables correlated with X reduce precision • Earlier, inclusion of relevant variables uncorrelated with X increases precision • They are consistent: • Including relevant variables increases precision • Including variables correlated with X reduces precision • Ambiguous effect on precision of including relevant variable correlated with X

Excluding Relevant Variables • Leads to Omitted Variable Bias if X1 and X2 are correlated

Is it better to exclude the relevant or include the irrelevant • Omitting relevant variables causes bias • Putting in irrelevant variables causes lower precision • Might conclude better to err on side of caution and include lots of regressors – the ‘kitchen sink’ approach • But: • May be prepared to accept some bias for extra precision • Can worsen problems of measurement error

Measurement Error • True value is X* • True model is: y=X*β+ε • But X* observed with error – observed value is X • Measurement error has classical form: X=X*+u, E(u|X*)=0 • Can write model in terms of observables as: y=Xβ-uβ+ε • X correlated with composite error (uβ +ε) so bias in OLS estimate

Proposition 4.2With one regressor (with classical measurement error) the plim of the slope coefficient is: • OLS estimate is biased towards zero – this is attenuation bias • Extent of bias related to importance of measurement error – signal-to-noise ratio, reliability ratio

The General Case • Have previous model but now have X to be more than one-dimensional • Some notation and assumptions: • Covariance matrix of u is Σ

Proposition 4.3The plim of the OLS estimator with many error-ridden regressors

Why is this?

Matrix equivalent of attenuation bias • But, in general case, hard to say anything about direction of bias on any single coefficient • If ΣXX* and Σ both diagonal then all coefficients biased towards zero

An Informative Special Case • Two variables One variable measured with error, the other measured without error

Earlier formula leads to:

Proposition 4.4Attenuation Bias of Error-Ridden Variable Worsens when other Variables are Included • Where ρ12 is correlation between X1 and X2 • If ρ12≠0 this attenuation bias is worse than when X2 excluded. • Intuition: X2 soaks up some of signal in X1 leaving more noise in what remains

Proposition 4.5The Presence of Error-Ridden Variables Causes Inconsistency on the Coefficients of Other Variables • This is inconsistent if X1 and X2 are correlated (σ12≠0) • Mirror image of previous result - X2 soaks up some of true variation in X1

An Extreme Case • Observed X1 is all noise, σ2u=∞ - its coefficient will be zero • Then we get: • Should recognise this as formula for omitted variable bias when X1 excluded

Measurement error in theDependent Variable • Suppose classical measurement error in y y=y*+u • Assume u uncorrelated with y,X • Then: y=Xβ+ε+u • X uncorrelated with u so OLS consistent • But is loss in precision so still a cost to bad data

Example 2: Including Variables as a Higher Level of Aggregation • X* is individual level variable • Only observe average value at some higher level of aggregation (e.g. village, industry,region) - call this X • Model for relationship between X and X* X*=X+u, E(u│X)=0 • Note change in format

In regression we have: y=X*β+ε y=Xβ+ε+u β • X and u uncorrelated so no inconsistency in OLS estimate of coefficient • But not ideal: • Loss in precision – less variation • Limits way to model other higher-level variables • May cause inconsistency in coefficients on other variables as E(u│X,Z) will depend on Z

Summary of results on omitted variables and measurement error • Including irrelevant variables leads to loss in precision • Excluding relevant variables leads to omitted variables bias • Measurement error in X variables typically causes attenuation bias in coefficients • Inclusion of other variables worsens attenuation bias (though may reduce omitted variables bias)

Strategies For Omitted Variables/Measurement Error • One strategy for dealing with omitted variables is to get data on variable and include it • One strategy for dealing with measurement error variables is to get better-quality data • These are good strategies but may be easier said than done • IV offers another approach if instrument can be argued to be correlated with true value of variable of interest and not with measurement error/ omitted variable

Clustered Standard Errors • In many situations individuals affected by variables that operate at a higher level e.g. industry, region, economy • Call this higher-level a group or cluster • Can include group-level variables in regression • May be difficult to control for all relevant group-level variables so common practice to include dummy variable for each group • These dummy variables will capture the impact of all group-level variables

Can write this model as.. • Where D is (NxG) matrix of group-dummies and θ vector of group-level effects (assume mean zero) • Will often see this but: • Low precision if number of groups is large (only exploits within-group variation in X) • Can’t identify effect of group-level variable X

Lets think some more about this case.. • Might think about dropping group-level dummies and simply estimating: y=Xβ+ε • But this assumes covariance between residuals of individuals in the same group is zero – this is very strong • Half-way house is to think of θ not as parameters to be estimated but ‘errors’ that operate at level of the group • Assume θ uncorrelated with X,ε

An Error Component Model • Error for individual i can be written as: • Variance of this error is: • Correlation between errors for individuals in same group (zero for those not in same group):

Why is this? • For individuals in the same group • As they have the same group-level component • For individuals in different groups covariance is zero as have different (and assumed independent) group-level component

Implications • Covariance matrix of composite errors, ui, will no longer be diagonal - denote by σ2Ω • OLS estimate will still be consistent (though not efficient) • Computed standard errors will be inconsistent – should be computed by:

With this particular error component model • i.e. usual formula plus something • Usual formula will be wrong if second term non-zero

Can say more…. • (X’D) will be a (kxG) matrix whose kth row and gth column will consist of the sum of values of Xk for those in group g • Suppose all groups equal size and Ng=(N/G) • Define a (Gxk) matrix of the average values of X in each group:

Using this in previous expression.. • For case of one regressor the variance of the slope coefficient will be: • Where Var(Xi) is variance in X across individuals, Var(Xg) is variance in X across groups

Case I: X correlation within and between groups the same • i.e. usual formula correct • Implies no (or small) problem with standard errors for variables which do not have much group-level variation

Case II: Group-Level Regressor • Standard formula understates true variance by a factor related to importance of group-level shock and the size of the groups

An even more special case.. • All individuals within groups are clones – ρ=1 • Then: • Really only have G observations • Simplest to estimate at group-level • But group-level estimation generally causes loss in efficiency so not best solution

Dealing with this in practice.. • STATA has an option to compute standard errors doing clustering: . reg y x1 x2, cl(x3) • Such standard errors are said to be clustered with the ‘cluster’ being x3 • So quite easy to do in practice

An example – the effect of gender and regional unemployment on wages • Data from UK LFS • Would expect gender mix not to vary much between region so most variation within region • Unemployment rate only has variation at regional level • Would expect clustering to reduce standard error on gender only a little but u-rate a lot

No clustering logwage | Coef. Std. Err. t -------------+--------------------------------- sex | -.2285092 .0091228 -25.05 urate | 1.057465 .3928981 2.69 _cons | 2.447221 .0228265 107.21 -----------------------------------------------

With clustered standard errors | Robust logwage | Coef. Std. Err. t -------------+--------------------------------- sex | -.2285092 .0110932 -20.60 urate | 1.057465 2.943567 0.36 _cons | 2.447221 .1494707 16.37 ----------------------------------------------- As predicted by theory

Conclusions • Good practice to cluster the standard errors if not going to include group-level dummies • This is particularly important for group-level regressors – standard errors will otherwise often be much too low

Non-Experimental Data II What Should Be Included in a Regression? Omitted Variables and Measurement Error

Non-Experimental Data II What Should Be Included in a Regression? Omitted Variables and Measurement Error

Presentation Transcript

Measurement Data Collection

Multiple Regression cont

Non-Experimental Data: Natural Experiments and more on IV

Spatial Data Analysis of Areas: Regression

Regression Analysis

measurement and data processing Topic 11

measurement and data processing Topic 11

AMCS/CS 340: Data Mining

Regression With Categorical Variables

Chapter 12 Regression with Time-Series Data: Nonstationary Variables

Correlation and Regression

SEM BASIC MODELS

Simple Linear Regression and

Non-Experimental Data: Natural Experiments and more on IV

Checking Regression Model Assumptions

Regression imputation with linear constraints on the variables

Introduction to Multiple Regression

Econometrics

General linear model and regression analysis

Experimental Error

Fall Exam