620 likes | 629 Views
Learn how to choose the right variables for your model to maximize accuracy and avoid common pitfalls. Understand the impact of including or excluding variables and how to make informed decisions. Discover the significance of variables in regression analysis.
E N D
Regression: Choosing Variables LIR 832 November 14, 2006
Topics of the Day… • Choosing Independent Variables • What variables should be in a model? • What is the effect of leaving out important variables? • What is the effect of adding in irrelevant variables? • How do we decide about this? Why not just toss everything in and let our t-stats or r-square solve this for us?
Example: Effect of Unions (x) on Weekly Earnings (y) reg lnwage cbc2 Source | SS df MS Number of obs = 156130 -------------+------------------------------ F( 1,156128) = 3897.11 Model | 1234.14281 1 1234.14281 Prob > F = 0.0000 Residual | 49442.8436156128 .316681464 R-squared = 0.0244 -------------+------------------------------ Adj R-squared = 0.0243 Total | 50676.9864156129 .324584071 Root MSE = .56274 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cbc2 | .2488057 .0039856 62.43 0.000 .2409941 .2566173 _cons | 2.469369 .001545 1598.30 0.000 2.466341 2.472397 ------------------------------------------------------------------------------
Example: Effect of Unions (x) on Weekly Earnings (y) reg lnwage cbc2 age Source | SS df MS Number of obs = 156130 -------------+------------------------------ F( 2,156127) = 7530.01 Model | 4458.26229 2 2229.13115 Prob > F = 0.0000 Residual | 46218.7241156127 .296032871 R-squared = 0.0880 -------------+------------------------------ Adj R-squared = 0.0880 Total | 50676.9864156129 .324584071 Root MSE = .54409 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cbc2 | .2014921 .00388 51.93 0.000 .1938874 .2090969 age | .0111539 .0001069 104.36 0.000 .0109444 .0113634 _cons | 2.043437 .0043461 470.17 0.000 2.034918 2.051955 ------------------------------------------------------------------------------
reg lnwage cbc2 age female married black other NE Midwest South city1mil ed3 ed4 aa ed6 ed7 Source | SS df MS Number of obs = 156130 -------------+------------------------------ F( 15,156114) = 5888.11 Model | 18311.0587 15 1220.73725 Prob > F = 0.0000 Residual | 32365.9277156114 .20732239 R-squared = 0.3613 -------------+------------------------------ Adj R-squared = 0.3613 Total | 50676.9864156129 .324584071 Root MSE = .45533 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cbc2 | .1360972 .0032913 41.35 0.000 .1296462 .1425481 age | .0067085 .000096 69.85 0.000 .0065203 .0068968 female | -.2151269 .002322 -92.65 0.000 -.2196779 -.2105759 married | .127496 .0025106 50.78 0.000 .1225752 .1324168 black | -.0645881 .0039931 -16.17 0.000 -.0724145 -.0567617 other | -.0454844 .0052715 -8.63 0.000 -.0558164 -.0351524 NE | .0089504 .0034877 2.57 0.010 .0021146 .0157862 Midwest | -.0148798 .0033238 -4.48 0.000 -.0213944 -.0083653 South | -.0260961 .0032539 -8.02 0.000 -.0324736 -.0197186 city1mil | .1118365 .0023835 46.92 0.000 .1071648 .1165081 ed3 | .2875855 .0038465 74.77 0.000 .2800464 .2951246 ed4 | .3676268 .0041132 89.38 0.000 .359565 .3756885 aa | .4949227 .0050869 97.29 0.000 .4849525 .5048929 ed6 | .7416187 .0042642 173.92 0.000 .7332609 .7499764 ed7 | .896922 .005259 170.55 0.000 .8866146 .9072295 _cons | 1.813933 .0050728 357.58 0.000 1.803991 1.823876 ------------------------------------------------------------------------------
reg lnwage cbc2 age female married black other NE Midwest South city1mil ed3 ed4 aa ed6 ed7 manager prof tech sales privhh protect servocc servocc farmer craft oper transop laborer Source | SS df MS Number of obs = 156130 -------------+------------------------------ F( 27,156102) = 4558.99 Model | 22342.7173 27 827.508049 Prob > F = 0.0000 Residual | 28334.2691156102 .181511249 R-squared = 0.4409 -------------+------------------------------ Adj R-squared = 0.4408 Total | 50676.9864156129 .324584071 Root MSE = .42604 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cbc2 | .1348609 .0031501 42.81 0.000 .1286866 .1410351 age | .0056959 .0000906 62.84 0.000 .0055183 .0058736 female | -.1960792 .0023927 -81.95 0.000 -.2007688 -.1913895 married | .0945142 .0023617 40.02 0.000 .0898854 .0991431 black | -.0497951 .0037475 -13.29 0.000 -.05714 -.0424501 other | -.0287192 .0049378 -5.82 0.000 -.0383971 -.0190413 NE | .0106994 .0032661 3.28 0.001 .0042979 .0171009 Midwest | -.0160232 .0031147 -5.14 0.000 -.0221278 -.0099185 South | -.0345 .003048 -11.32 0.000 -.040474 -.028526 city1mil | .1006931 .0022359 45.04 0.000 .0963108 .1050754 ed3 | .2163545 .0036596 59.12 0.000 .2091817 .2235273 ed4 | .2570192 .0039814 64.55 0.000 .2492157 .2648228 aa | .3307331 .0049498 66.82 0.000 .3210316 .3404345 ed6 | .5085537 .004477 113.59 0.000 .4997789 .5173285 ed7 | .6125842 .0056601 108.23 0.000 .6014905 .6236779 manager | .3553568 .0039626 89.68 0.000 .3475901 .3631235 prof | .2786787 .0041472 67.20 0.000 .2705503 .2868071 tech | .2750721 .0062083 44.31 0.000 .262904 .2872401 sales | .0288982 .0040054 7.21 0.000 .0210478 .0367487 privhh | -.3069562 .0139645 -21.98 0.000 -.3343264 -.2795861 protect | .0610202 .0081706 7.47 0.000 .045006 .0770344 servocc | -.3478074 .0052614 -66.11 0.000 -.3581196 -.3374952 farmer | -.1941755 .0089707 -21.65 0.000 -.2117578 -.1765931 craft | .1923506 .0043155 44.57 0.000 .1838922 .2008089 oper | .0161818 .0051605 3.14 0.002 .0060673 .0262963 transop | -.0171413 .0066874 -2.56 0.010 -.0302485 -.004034 laborer | -.1110402 .0058008 -19.14 0.000 -.1224096 -.0996708 _cons | 1.896043 .0055862 339.42 0.000 1.885094 1.906992 ------------------------------------------------------------------------------
Example: Effect of Unions (x) on Weekly Earnings (y) • Some observations…: • The returns to union membership are sensitive to age and educational attainment. Union members tend to be older and have higher educational attainment than other members of the labor force. Once we control for those factors, estimated returns to union membership are lower. • Similarly, union members tend to be male. Absent a control for gender, part of the male wage advantage is attributed to union membership. • In contrast with the first two points, after all the other controls, further control for occupation doesn’t really do very much.
Example: Effect of Unions (x) on Weekly Earnings (y) • Conclusions: • What you have in the model may affect your estimates. • This is not always the case. • Linguistics: • We call the variables we place in models to remove the effects of correlates of the variables we are interested in “CONTROLS”. They are there to control for other factors that influence our dependent variable.
Choosing Model Specification (“What variables do I use?”) • Q: How do we decide what should be in the model? • A: It depends on the question we are trying to answer. • Example: If we just want to know how much more a union member earns than a non-member overall, then our first estimate is fine. • Example: If we want to measure how much union membership increases the earnings all else equal (ceteris paribus), then we need to build a regression model that controls for the other influences on earnings… • Education • Occupation • Experience • Gender • And on and on…
What is Misspecification? • “Misspecification” is: • 1. Omitting variables that should be included. • 2. Adding variables that should not be included.
Omitted Variables • Let’s define the “true” model as the correct model for explaining the issue. We are going to work with population models so we don’t have the added problem of sampling variability. Let’s write this out in our typical form:
Omitted Variables • Now, suppose we estimate a model leaving out X2:
Omitted Variables • Let’s rewrite the first equation so that it looks like the second equation: • 1. Our error term, in { } now contains both ε and (Since they are both omitted and therefore unobserved). • 2. The problem: If X2 is correlated with X1, then the coefficient on X1 will pick up both the effect of X1 and the effect of X2.
Omitted Variables • Let’s think about the effects of the correlation of X1 and X2 using regression:
Omitted Variables • Now let’s substitute this expression for X2 into equation 3.
Omitted Variables • Indulge in a little artful re-arranging of terms: • In the final model, our α’s combine the effect of X2 and of X3, so we are not getting the pure effect of X2. Rather the α coefficient combines the effect of X2 and of X3
Omitted Variables: What We Have Learned • As our union example indicated, omission of important influences can bias measured effects:
Omitted Variables: What We Have Learned • 1. As the last estimate indicates, some types of variables do not make a substantial difference. • 2. The bias imparted by omitted variables will be driven by: • A. The magnitude of the effect of the omitted variable. • The strength of the correlation with other variables in the model.
Omitted Variables:What We Have Learned • Omitted variable bias: • α1 = β1 + β2γ1 • The bias in α1 is β2γ1 • So the magnitude of the bias is related to: • β2, the effect of the omitted variable on the dependent variable • If the effect is small, β2 is close to zero, then there isn’t much bias • γ1, the “correlation” of the omitted variable with the explanatory variable • If the ‘correlation’ is low, γ1 is close to zero, then there isn’t bias.
Omitted Variables: Example • Q: Why is omitted variable bias a problem? • A: An Example from Safety and Health Research: • The theory of compensating differentials suggests that increased risk of death by industry and occupation will result in higher earnings as a “compensating wage differential.” • Typical micro-data model for estimating this has been something of the form: • Where we have a plain vanilla wage equation and add a measure of risk of death by industry or occupation.
Omitted Variables: Example • A typical wage regression of this type indicates that wages are raised by around the apparently minuscule 0.05% for each increase in fatalities of 1 in 100,000 employees. With median U.S. annual earnings of $35,000, this modest increment works out to: • 0.0005*35,000 = $17.50 annually per worker • 100,000 * $17.50 = $1,750,000 per fatality. • The implicit value of life is then $1,750,000 purely through wage mechanism, not life insurance. • Used to argue that the market adjusts for risk. Policy implication is that there isn’t a great need to government intervention in safety and health.
Omitted Variables: Example • However, there is a separate literature which suggests that industry factors other than risk of death affect wages. These include: • Capital-labor ratios • Size of establishment • Value added per worker • Industry unemployment rates • Female Density • Union Density
Omitted Variables: Example • Issue: Are the measured returns to risk accurately measured, or is there a problem with omitted variable bias because other industry factors have not been included in the equation? If so, what is the compensating differential once we control for other industry factors.?
Omitted Variables: Example • Question examined in: “Wage Compensation for Dangerous Work Revisited” Dorman and Hagstrom (ILRR, 1998, Vol 52, Number 1). • Strategy for estimation: • 1. Estimate a prototypical wage model with control for risk. • 2. Add controls for industry in two forms. • First, add dummy variables for industries (mining, construction, durable mfg, non-durable mfg) to examine the effect. • Second, replace the dummies with industry characteristics including Value Added, establishment size, assets per employee, percent female.
Omitted Variables: Example • Data used: Panel Study of Income Dynamics (PSID). • Measures of occupational risk include: • NTOF: National Traumatic Occupational Fatality: frequency of fatalities by 100,000 workers by state and industry • Lost work day cases due to occupational injuries in 1981 per 100 workers by industry. • Used male samples for construction, mining and manufacturing
Omitted Variables: Example • Estimation Strategy: • Estimate the plain vanilla return to risk equation • Divide between union and nonunion to determine union effect • Add industry controls as dummies or as measures
Omitted Variables: Example • Examining the output: • Note difference in effects by union and non-union • Union effect is larger and remains fairly similar across estimates. • Non-union effects: • Smaller in magnitude • Much more sensitive to change in specification • NTOF falls toward non-significance • Injury days becomes negative and highly significant. • Conclusion: • Not much evidence of compensating differentials for non-union workers. • Specification matters a lot.
Omitted Variables: Summary • Problem of important omitted variables: • If explanatory variables are omitted from your equation, and they are correlated with variables which are included in the model. • Your estimated coefficients will not reflect just the effect of the variable included in the model, it will also pick up the effect of the omitted variable. • Your coefficients are, in a sense, wrong or biased, they are systematically over or under shooting.
Correcting Omitted Variable Bias • Possible approaches to omitted variable bias: • The problem: My illustrations are misleading as they generally presume that you have the data and left it out by mistake. If you don’t have the data, you cannot go through this exercise, you are stuck with omitted variable bias. What should you do? • If you are reasonably concerned about omitted variable bias in a study you can: • Get the damn data. This is one reason you plan in advance. It is costly to try to go back, possibly impossible. • Use a proxy for the datawhich you would prefer to have. • You may not have exactly the variable which you would like to use, but you may be able to find an alternative which is close and largely eliminates the problem of omitted variable bias • The better is the enemy of the good • Example: you would like to control for years of education, but only have a measure of no high school, high school degree and college degree. These three indicator variables are proxies for the preferred measure of education.
Omitted Variable Bias: Example The regression equation is weekearn = - 402 + 6.29 age - 319 female + 76.4 years ed 47576 cases used, 7582 cases contain missing values Predictor Coef SE Coef T P Constant -401.76 18.87 -21.29 0.000 age 6.2874 0.2021 31.11 0.000 female -318.522 4.625 -68.87 0.000 years ed 76.432 1.089 70.16 0.000 S = 500.391 R-Sq = 20.8% R-Sq(adj) = 20.8%
Omitted Variable Bias: Example The regression equation is weekearn = 339 + 6.64 age - 324 female + 224 HS + 273 SC + 319 AA + 505 BA + 650 Grad 47576 cases used, 7582 cases contain missing values Predictor Coef SE Coef T P Constant 338.58 20.36 16.63 0.000 age 6.6430 0.2039 32.58 0.000 female -324.168 4.626 -70.07 0.000 HS 224.05 19.80 11.32 0.000 SC 272.97 19.60 13.93 0.000 AA 319.43 20.12 15.88 0.000 BA 504.83 18.98 26.60 0.000 Grad 649.96 19.19 33.86 0.000 S = 500.268 R-Sq = 20.8% R-Sq(adj) = 20.8% Q: Which direction is the bias?
Irrelevant Variables • Q: What happens if you add variables to a model that do not belong there? • A: If it is really irrelevant…: • The coefficient on that variable will be close to, or equal to, zero. • Other coefficients are unchanged or don’t change much. • The standard error of regression for all coefficients will be larger than it would be if that variable was not included. • t-tests will be less likely to reject the null hypothesis than with the correct specification. • This won’t matter as much when working with moderately large data sets.
Irrelevant Variables: Example from Managers and Professionals Data
reg lnwage3 female black other married age age2 NE Midwest South metro ed2 ed3 ed4 aa ed6 ed7 manager prof tech sales privhh protect servocc farmer craft oper transop laborer cbc2 parttime Source | SS df MS Number of obs = 149649 -------------+------------------------------ F( 30,149618) = 4338.08 Model | 22409.2525 30 746.975084 Prob > F = 0.0000 Residual | 25762.7886149618 .172190435 R-squared = 0.4652 -------------+------------------------------ Adj R-squared = 0.4651 Total | 48172.0411149648 .321902338 Root MSE = .41496 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cbc2 | .1121468 .0031292 35.84 0.000 .1030136 .1152801 female | -.1788285 .0024188 -73.93 0.000 -.1835694 -.1740877 black | -.0623781 .0037208 -16.76 0.000 -.0696707 -.0550855 other | -.0357962 .0048993 -7.31 0.000 -.0453987 -.0261937 married | .0540003 .0024207 22.31 0.000 .0492558 .0587447 age | .0361599 .00052 69.54 0.000 .0351408 .037179 age2 | -.0003663 6.14e-06 -59.71 0.000 -.0003784 -.0003543 NE | .0213962 .0032505 6.58 0.000 .0150254 .027767 Midwest | -.009636 .0030984 -3.11 0.002 -.0157088 -.0035631 South | -.0476498 .0030283 -15.73 0.000 -.0535853 -.0417144 metro | .1089392 .0026696 40.81 0.000 .1037069 .1141716 ed2 | .0937357 .0062394 15.02 0.000 .0815066 .1059649 ed3 | .2061799 .0052296 39.43 0.000 .1959299 .2164298 ed4 | .2588149 .0054812 47.22 0.000 .2480718 .269558 aa | .3067146 .006221 49.30 0.000 .2945216 .3189076 ed6 | .4814624 .0058623 82.13 0.000 .4699724 .4929524 ed7 | .5912883 .0067514 87.58 0.000 .5780556 .6045209 manager | .3273871 .0039228 83.46 0.000 .3196984 .3350758 prof | .2712431 .0041042 66.09 0.000 .2631989 .2792873 tech | .2513825 .0061741 40.72 0.000 .2392814 .2634836 sales | .0534852 .0040032 13.36 0.000 .045639 .0613314 privhh | -.2463923 .0144294 -17.08 0.000 -.2746735 -.2181111 protect | .0620207 .0081107 7.65 0.000 .0461238 .0779175 servocc | -.2830721 .0054013 -52.41 0.000 -.2936586 -.2724857 farmer | -.182219 .0092575 -19.68 0.000 -.2003635 -.1640744 craft | .1584377 .0043139 36.73 0.000 .1499826 .1668929 oper | -.0234436 .0051645 -4.54 0.000 -.0335659 -.0133212 transop | -.0209505 .0067341 -3.11 0.002 -.0341491 -.0077519 laborer | -.096057 .0058562 -16.40 0.000 -.107535 -.0845789 parttime | -.1509533 .0030135 -50.09 0.000 -.1568598 -.1450469 _cons | 1.348726 .0114219 118.08 0.000 1.326339 1.371113 ------------------------------------------------------------------------------
reg lnwage3 female black other married age age2 NE Midwest South metro ed2 ed3 ed4 aa ed6 ed7 manager prof tech sales privhh protect servocc farmer craft oper transop laborer union2 parttime msafips Source | SS df MS Number of obs = 149649 -------------+------------------------------ F( 31,149617) = 4209.62 Model | 22442.0795 31 723.938047 Prob > F = 0.0000 Residual | 25729.9616149617 .17197218 R-squared = 0.4659 -------------+------------------------------ Adj R-squared = 0.4658 Total | 48172.0411149648 .321902338 Root MSE = .4147 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| -------------+---------------------------------------------------------------- cbc2 | .1184786 .0032596 36.35 0.000 female | -.1783494 .0024174 -73.78 0.000 black | -.0634502 .0037195 -17.06 0.000 other | -.0366422 .0048967 -7.48 0.000 married | .0542679 .0024195 22.43 0.000 age | .0361392 .0005196 69.55 0.000 age2 | -.0003661 6.13e-06 -59.72 0.000 NE | .0207891 .0032498 6.40 0.000 Midwest | -.0046121 .0031571 -1.46 0.144 South | -.0453435 .0030338 -14.95 0.000 metro | .0911277 .0032975 27.64 0.000 ed2 | .093557 .0062356 15.00 0.000 ed3 | .2061243 .0052264 39.44 0.000 ed4 | .2586379 .0054778 47.22 0.000 aa | .3066968 .006217 49.33 0.000 ed6 | .4815177 .0058583 82.19 0.000 ed7 | .5915427 .006746 87.69 0.000 manager | .3274591 .00392 83.54 0.000 prof | .2719271 .0041006 66.31 0.000 tech | .2515698 .0061703 40.77 0.000 sales | .0533251 .0039992 13.33 0.000 privhh | -.2474892 .0144197 -17.16 0.000 protect | .0611862 .0081046 7.55 0.000 servocc | -.2832982 .0053977 -52.49 0.000 farmer | -.1827259 .0092516 -19.75 0.000 craft | .1574825 .0043124 36.52 0.000 oper | -.0239928 .0051619 -4.65 0.000 transop | -.0216172 .0067303 -3.21 0.001 laborer | -.0967726 .0058531 -16.53 0.000 parttime | -.1508969 .0030115 -50.11 0.000 msafips | 4.17e-08 2.50e-08 1.66 0.099 _cons | 1.347705 .0114205 118.01 0.000 ------------------------------------------------------------------------------
Irrelevant Variables: Example from Managers and Professionals Data • By adding a city number (coding for city) to the wage equation: • The effect is very small in scale. The largest value is 9360 so call it 10,000. • 10,000*.00000004 = .0004 or 4/100ths of a percent. • The city # variable is barely significant in a two tailed 10% test. Pretty weak test given the size of the sample and the t-statistics we are getting for other variables. • Has little or no effect on other variables. CBC and Female barely change, change in Black is small in size (less than one pp). • This would not be the case if our irrelevant variable was correlated with some of our other variables.
Specification Criteria • Prior information: What can we learn before we start estimating. • Theory • What are you trying to measure? • Example of union effect on wages: • Do we want to know how much more union members make on average? • Or, do we want to know how much an otherwise similar person would earn if they moved from an open shop to a organized job? • Theory, careful thinking about our issue is central to developing a good specification. • Prior research also provides essential guidance • Typically reflects considerable experience with multiple data sets
Specification Criteria • How do our estimates behave as we alter our specification (confirmatory, not a means of determining the equation)? • 1. We should pay attention to the behavior of…: • coefficient sign and magnitude • t-test • bias
Specification Criteria • 2. Omitted variables. When added…: • The coefficient will be large in magnitude and correctly signed • It will be strongly statistically significant • It will increase as the variable has explanatory power • The coefficients, particularly those of interest will change as bias is removed
Specification Criteria • 3. Irrelevant variables. When added…: • The coefficient close to 0 • The coefficient will not be statistically significant • The coefficient will not increase and will likely fall (depends on sample size) • Other coefficients, particularly those of interest, will not change as we are not eliminating bias
Specification Criteria • Q: Why don’t we simply use our samples to specify our models (using our four criteria)? • A: This approach is used in theory building in natural and social sciences. • Approach is to use an initial data set to look for correlations among the variables to explain some outcome. • People then build hypothesis based on correlations. Often develop correlaries of initial ideas as theory has developed. • Find or collect new data sets to test those theories • Trying to use Sample Data to specify a model can lead some very silly places.
Deductive vs. Inductive • Several approaches to understanding the world: • 1. Deductive: begin with a theory, seek confirmation using statistical methods. • 2. Inductive: search the data to find regularities, construct theory, use new data to test the theory (exploratory vs confirmatory research).
Deductive vs. Inductive • Deductive: • Note: Tufte strongly supports a theory driven approach, we start with a causal model and use our data to explore that causal relation. • Why, in general, we don’t simply let the sample data guide our specification?
Deductive vs. Inductive • Example: We are trying to predict the amount of Brazilian coffee consumed annually. Economic theory strongly suggests that price plays an important role in the demand for consumer goods: Coffee = 9.1 + 7.8*P(bc) + 2.4*P(tea) + .0035Y(disposable Inc) t (0.5) (2.0) (3.5) R-squared = .60 n = 25 • Idea: t on P(bc) is non-significant, why not drop?
Deductive vs. Inductive Coffee = 9.1 + 2.6*P(tea) + .0036Y(disposable Inc) t (2.6) (4.0) R-squared = .61 n = 25 • Small rise in coefficient of determination, little change in other coefficients. • But, in fact, we have an issue with an omitted variable rather than an irrelevant variable. We failed to include the price of a close substitute, Columbian coffee: