280 likes | 344 Views
Overview of Regression Analysis. Conditional Mean. We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males is $54,648 (March 2010) We are also often interested in how this mean differs by other individual characteristics.
E N D
Conditional Mean • We all know what a mean or average is. • E.g. The mean annual earnings for 25-44 year old working males is $54,648 (March 2010) • We are also often interested in how this mean differs by other individual characteristics. • E.g. How do the mean earnings differ between black and non-black workers? • Mean earnings for working non-black males ages 25-44 = $56,614 • Mean earnings for working black males ages 25-44 = $39,380 • These are known as Conditional Means (the mean conditioned on some other characteristics, in this case race) • So without controlling for anything else, 25-44 yr old black working males earn on average $17,234 less annually, or 30% less, than similar aged white working males.
Conditional Means • When testing a theory though, we often want to know how much of a given mean difference can be attributed to a particular observable variable, after controlling for other observable differences. • For example, we also know that earnings are highly tied to schooling, and there is a significant racial gap in schooling, so we might want to know how large is racial earnings gap net of racial differences in years of schooling (i.e., controlling for schooling).
Conditional Means • One way to do this is to calculate even more complicated conditional means. • E.g., • Non-Black males between 25-44 w/out hs degree = $25,278 • Black males between 25-44 w/out hs degree = $22,275 • Non-Black males between 25-44 w/ hs degree = $41,922 • Black males between 25-44 w/ hs degree = $33,670 • Non-Black males between 25-44 w/ college degree = $80,295 • Black males between 25-44 w/ college degree = $61,136
Conditional Means • Then, we can find how much less blacks earn than whites, after controlling for education, via the following weighted mean formula: • where i corresponds to the three education categories, • nb,i/ nbcorresponds to the fraction of black male workers in education category i, • earningsb,icorresponds to the mean earnings for black workers in education category i, • earningsw,icorresponds to the mean earnings for white workers in education category i. • Doing so we find that according to the above conditional mean calculations, black male workers earn about $11,064, or 11,064/54,648 = 20 percent less, than white male workers with similar education characteristics • So conditioning on years of education explains about 33% of racial earnings gap ([0.30 - 0.20]/0.30 = 0.33)
Conditional Means • This can be quite cumbersome to compute all these conditional means though, especially if we start adding in more categories for education • e.g., only up to 10th grade, only up to 11th grade, only up to 12th grade, 1 yr of college, 2 years of college, 3 years of college, etc. • Moreover, what if we are also interested in the impact of another year of schooling on earnings, after controlling for race? That would require a whole new set of calculations.
Regression • This is why a regression model is often a simpler way to describe conditional means. earningsi = α + β1*blacki + β2*yrs of schooli + ei • α is known as intercept, β’s are (slope) coefficients, ei is the “residual” • Estimating a regression amounts to finding the intercept and slope coefficients that minimize the sum of the squared ei terms across the sample (i.e. find best “fit”) • So intercepts and coefficients essentially account for the variation in the dependant variable (earnings) that is common across all people with respect to the control variables, while the residual is the individual specific variation, or how each individual differs from the average. • Graphically?
Regression α α+β1 Slope = β2 Earnings Yrs of Schooling
Regression • When I estimate this model I get: • earningsi = -70,003 – 10,381*black + 8,888*yrs of schoolingi + ei (1968) (1,126) (138) • or • Computing the equation for particular characteristics without the ei term gives “expected,” or average, earnings for a person with those characteristics. So for a non-black with 12 years of schooling, expected earnings are: • -70,003 – 10,381*0 + 8,888*12 = $36,653 • How do we interpret specific coefficients?
Regression • The way to interpret coefficients (i.e. “Betas”) • “The marginal change in the conditional mean of the dependant variable due to a one unit increase in that characteristic, holding all other characteristics constant.” • So, one way to determine the marginal impact of a given characteristic on the dependant variable (e.g., the impact of another year of schooling on earnings), is to simply take the see how the “expected” outcome of the dependant variable would differ if two individuals differed by one unit in that characteristic, but were otherwise the same.
Regression • For example, consider our estimated earnings regression earningsi = -70,003 – 10,381*blacki + 8,888*yrs of schoolingi + ei • Finding this “difference” between two individuals who were the same on all other characteristics (i.e., same race), but one had s years of education while the other had s+1, we get -70,003 – 10,381*black + 8,888*(s+1) – (-70,003 – 10,381*black + 8,888*s) = [(s+1)-s]*8,888 = 8,888 • So, under this specification, “marginal” impact of another year of schooling on earnings is β2, or simply the coefficient on the years of schooling variable.
Regression • Consider again our estimated earnings regression earningsi = -70,003 – 10,381*black + 8,888*yrs of schoolingi + ei • Doing a similar exercise with the “black” indicator variable (i.e., holding yrs of schooling constant and comparing an individual with black = 1 to an individual with black = 0) we get -10,381. • This means that, holding everything else equal (i.e. yrs of education), on average black workers earn $10,381 less than white workers. • This compares similarly to the $11,064 conditional pay differential we computed before, but is still a little different. Why?
Regression • Often, when we run regressions, we aren’t really interested in “point estimates” (i.e. specific coefficient estimates), but rather in using these estimates to test hypotheses. • For example, what if what we are really interested in is whether black workers have a lower return to an additional year of schooling than white workers. • How could we test this?
Regression • What if I added in an “interaction” term between schooling and race? earningsi = α + β1*blacki + β2*yrs of schooli + β3*blacki*yrs of schooli + ei • Doing this estimation I get: earningsi = -47,011 + 1381*blacki + 7,321*yrs of schooli - 982*black*yrs of schooli • How do we interpret these coefficients? • What is the avg impact of another year of schooling on a black worker’s earnings? • What is the avg impact of another year of schooling on a white worker’s earnings? • So marginal impact of another year of schooling on earnings for black workers is given by β2+ β3, so hypothesis test amounts to determining whether β3 is “statistically” different from zero.
Regression • Precision/Significance of estimates: • Consider again the previous estimates • What we are testing is whether coefficient of interest is “significantly” different than zero (i.e., how likely is it that we would have gotten this large of an estimate by chance even if it was really equal to zero) • To hypothesis test, we must compare size of coefficient to its standard error. • A good rule of thumb is that absolute magnitude of coefficient is more than twice standard error. • What will generally impact whether an estimate is significant?
Specification form • Often when doing regressions researchers will use the natural log of earnings rather than simply earnings as the dependant variable: ln(earningsi)= α + β1*blacki + β2*schoolingi + ei • This is done for two reasons: • This specification often “fits” the data better, as log transformation makes a variable with a highly skewed distribution closer to a normal distribution, which generally helps the regression fit. • The coefficients can be roughly interpreted as percentage changes in dependant variable associated with a unit change in the corresponding control variable (i.e., elasticity), rather than how the level of the dependant variable changes given a unit change in the corresponding control variable.
Omitted variables • If we are really interested in the wage gap between black workers and white workers after conditioning on years of education, what are we missing from the basic specification that might obscure the answer we are really looking for? ln(earningsi)= α + β1*blacki + β2*schoolingi + ei
Omitted variables • ln(earningsi)= α + β1*blacki + β2*Hispanici + β3*schoolingi + ei • What will this likely do to coefficient on black indicator?
Omitted variables • ln(earningsi)= α + β1*blacki + β2*Hispanici + β3*schoolingi + ei • What will this likely do to coefficient on black indicator?
Omitted variables • What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?
Omitted variables • What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?
Omitted variables • In the end, it is not necessary to control for every possible thing that can affect dependant (y, or left-hand side) variable. • What to control for depends on your question of interest. • Robustness – • A finding is said to be relatively robust if basic qualitative finding is unchanged by inclusion of further variables, adding more interaction terms (i.e., the combination of two existing variables such as the term black*years of school), or changes in specification form (i.e. log transformation of dependant variable)
Selection • Be very weary of making causal inferences of significant correlations • In particular, there are often issues of sample selection/endogeneity/omitted variables • Many characteristics are often the products of choice (often called endogenous characteristics). • In such cases it is hard to identify how the outcome of interest depends on that endogenous characteristic, versus other unobserved/omitted characteristics that determined that choice. • Consider again the expected “compensating wage differential” for higher risk jobs. Does this truly capture the average tolerance for risk? • Consider the Brooklyn Bridge “effect” on wages.
Selection • More realistically, consider trying to estimate the causal “effect” of being in a gang on individual crime. • What might be the concern of regressing number of crimes committed on an indicator for whether someone is in a gang or not, even after controlling for household income, race, age, and neighborhood characteristics? • How about estimating the causal “effect” of being married on individual crime by regressing number of crimes committed on an indicator for whether someone is married or not, even after controlling for income, race, age, and neighborhood characteristics?
Summary • In summary, • Coefficient on a given variable tells you how the expected change in the outcome of interest due to a one unit change in that variable, after controlling for all of the other included characteristics. • Little credence should be given to imprecisely estimated coefficients (i.e. large enough standard errors so that they are not statistically different from zero), especially when hypothesis testing. • Part of the key details of a paper is the “empirical strategy” it uses to deal with selection effects. • Much of this class will be spent on discussing various empirical strategies authors use in the papers we read. • In the end, use your empirical intuition---can this data really answer the question of interest?