230 likes | 503 Views
Lecture 26. Omitted Variable Bias formula revisited Specially constructed variables Interaction variables Polynomial terms for curvature Dummy variables for categorical variables. Omitted Variable Bias Formula Revisited.
E N D
Lecture 26 • Omitted Variable Bias formula revisited • Specially constructed variables • Interaction variables • Polynomial terms for curvature • Dummy variables for categorical variables
Omitted Variable Bias Formula Revisited • From paper “Re-examining Criminal Behavior: The Importance of Omitted Variable Bias,” by David Mustard, Review of Economics and Statistics, 2003. • To what extent do changes in the arrest rate alter the willingness of individuals to engage in criminal activity? • Becker’s economic theory of crime: Those involved in illegal activities respond to incentives in much the same way as those who engage in legal activities respond.
Regressions for economic theory of crime • Goal: Figure out the causal effect of an increase in the arrest rate on the number of crimes committed. • When Y=log number of crimes committed in a city is regressed on X=log crime rate in a city, the coefficient on X is -.0119 for murder rate, -.0020 for assault rate and -.0117 for burglary rate (coefficient of -.0117 in log-log regression implies that a 1% decrease in the arrest rate for burglaries is associated with a 1.17% decrease in the number of burglaries). • Simple regression omits the confounding variable of the conviction rate. What is the direction of the omitted variables bias?
Omitted Variables Bias Formula • = the explanatory variable for which we want to find its causal effect on y. = confounding variables we control for by including them in regression. = omitted confounding variable. • Then or equivalently • Formula tells us about direction and magnitude of bias from omitting a variable in estimating a causal effect. • Formula also applies to least squares estimates, i.e.,
Application of OVB formula • y=crime rate • Here is probably negative. Increase in conviction rate should reduce crimes, holding other variables fixed. • Mustard presents evidence that is negative. As more people are arrested for a given offense level, amount of evidence against each arrestee decreases. • If and both negative, then . The estimate that a 1% increase would reduce the burglary rate by 1.17% is an underestimate of the impact of increase in arrest rate on reducing burglary rate (i.e., coefficient in log-log regression is <-.0117, reduction is greater than 1.17%).
Specially Constructed Explanatory Variables • Interaction variables • Squared and higher polynomial terms for curvature • Dummy variables for categorical variables.
Interaction • Interaction is a three-variable concept. One of these is the response variable (Y) and the other two are explanatory variables (X1 and X2). • There is an interaction between X1 and X2 if the impact of an increase in X2 on Y depends on the level of X1. • To incorporate interaction in multiple regression model, we add the explanatory variable . There is evidence of an interaction if the coefficient on is significant (t-test has p-value < .05).
Polynomials and Interactions Example • An analyst working for a fast food chain is asked to construct a multiple regression model to identify new locations that are likely to be profitable. The analyst has for a sample of 25 locations the annual gross revenue of the restaurant (y), the mean annual household income and the mean age of children in the area. Data in fastfoodchain.jmp
Polynomial Terms for Curvature • To model a curved relationship between y and x, we can add squared (and cubic or higher order) terms as explanatory variables. • Fit as a multiple regression with two explanatory variables and • To draw a plot of the estimated mean of Y|X, after Fit Model, click red triangle next to Response, Save Columns, Predicted Values. Then click Graph, Overlay Plot and Put Predicted Revenue and Revenue into Y, Columns and Income into X. Left Click on the Box next to Predicted Revenue in the legend and select Connect Points.
Interpreting Coefficients and Tests for Polynomial Model • Coefficients are not directly interpretable. Change in the mean of Y that is associated with a one unit increase in X depends on X. • To test whether the multiple regression model with X and X2 as predictors provides better predictions than the multiple regression model with just X, use the p-value of the t-test on the X2 coefficient (null hypothesis is that X2 has a zero coefficient). • Plot residuals vs. X to determine whether quadratic model is appropriate. If there is still a pattern in the mean, can try a cubic model with X, X2 and X3.
Regression Model for Fast Food Chain Data • Interactions and polynomial terms can be combined in a multiple regression model • For fast food chain data, we consider the model • This is called a second-order model because it includes all squares and interactions of original explanatory variables.
fastfoodchain.jmp results • Strong evidence of a quadratic relationship between revenue and age, revenue and income. Moderate evidence of an interaction between age and income.
Categorical variables • Categorical (nominal) variables: Variables that define group membership, e.g., sex (male/female), color (blue/green/red), county (Bucks County, Chester County, Delaware County, Philadelphia County). • Categorical variables can be incorporated into regression through dummy variables. • They can also be directly incorporated, as is done in JMP.
Fedex data set • Before it was a well-known company, FedEx undertook a campaign to promote use of its Courier packages (now called Fedex Paks). Sales representatives visited customers and worked to increase their use of the packages. Some of the customers were already aware of the Courier packaging before the promotion began, but it was unknown to others. • Response variable: Number of Courier package shipments per month. Explanatory variables: (1) Number of contact hours customer had with sales representative (hours of effort), (2) Categorical variable indicating whether or not promotion was effective for customers who were already aware of product (aware) • Question: Was this promotion more effective for customers who were already aware of the product, or was it more effective for those who had been unaware.
Two sample analysis Problem with two sample analysis: Hours of effort may be a confounding variable.