200 likes | 326 Views
Class 17: Tuesday, Nov. 9. Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted Variables Bias Discuss final project. Interpreting Multiple Regression Coefficients: Another Example.
E N D
Class 17: Tuesday, Nov. 9 • Another example of interpreting multiple regression coefficients • Steps in multiple regression analysis and example analysis • Omitted Variables Bias • Discuss final project
Interpreting Multiple Regression Coefficients: Another Example • A marketing firm studied the demand for a new type of personal digital assistant (PDA). The firm surveyed a sample of 75 consumers. Each respondent was initially shown the new device and then asked to rate the likelihood of purchase on a scale of 1 to 10, with 1 implying little chance of purchase and 10 indicating almost certain purchase. The age (in years) and income (in thousands of dollars) were recorded for each respondent. The data are in pda.JMP.
Simple Regressions to Predict Rating (Likelihood of Purchase) • As income rises, the likelihood of purchase also increases; specifically a $10,000 increase in income is associated with a 0.7 increase in rating. • As age increases, the likelihood of purchase also increases; specifically a 10-year increase in age is associated with a 0.9 increase in rating.
Multiple Regression • For any fixed level of income, the average rating decreases by 0.7 if Age increases by 10 years. • For all fixed income levels, old consumers have higher ratings on average than young consumers and at all fixed age levels, average ratings increase as income rises. • Positive association between age and rating is a result of positive association between age and income.
Air Pollution and Mortality • Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961. • The variables are • y (MORT)=total age adjusted mortality in deaths per 100,000 population; • PRECIP=mean annual precipitation (in inches); EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of Nox (related to amount of tons of Nox emitted per day per square kilometer); SO2=relative pollution potential of SO2
Multiple Regression: Steps in Analysis • Preliminaries: Define the question of interest. Review the design of the study. Correct errors in the data. • Explore the data. Use graphical tools, e.g., scatterplot matrix; consider transformations of explanatory variables; fit a tentative model; check for outliers and influential points. • Formulate an inferential model. Word the questions of interest in terms of model parameters.
Multiple Regression: Steps in Analysis Continued • Check the Model. (a) Check the model assumptions of linearity, constant variance, normality. (b) If needed, return to step 2 and make changes to the model (such as transformations or adding terms for interaction and curvature); (c) Drop variables from the model that are not of central interest and are not significant. • Infer the answers to the questions of interest using appropriate inferential tools (e.g., confidence intervals, hypothesis tests, prediction intervals). • Presentation: Communicate the results to the intended audience.
Air Pollution and Mortality • Question of interest: What is the association between the air pollution variables (NOX and S02) once environmental variables (precipitation) and demographic variables have been taken into account?
Curvature in relationship between Mortality and S02. Tukey’s Bulging Rule suggests transforming S02 to log S02 as a possible remedy. The scatterplot of Mortality vs. NOX is “crunched.” When a scatterplot between a response and explanatory variable “crunched,” transforming the explanatory variable to log(explanatory variable) is a good idea.
Initial Model Checking for influential points: New Orleans has Cook’s distance of 1.75 and leverage 0.45>(3*6/60). We should remove New Orleans, noting that it has unusual explanatory variables and that our conclusions do not apply to explanatory variables in the range of New Orleans.
Because New Orleans is an influential point and has leverage 0.45>(3*6/60)=0.30, we remove it and note that our model does apply to observations in the range of explanatory variables of New Orleans.
Model Building • Model Parsimony: If a variable is not of central interest and is not significant, we remove it from the model. • We can remove Education. We don’t remove log NOX since it is of central interest.
Inference About Questions of Interest • Strong evidence that mortality is positively associated with S02 for fixed levels of precipitation, education, nonwhite, NOX. • No strong evidence that mortality is associated with NOX for fixed levels of precipitation, education, nonwhite, S02.
Multiple Regression and Causal Inference • Goal: Figure out what the causal effect on mortality would be of decreasing air pollution (and keeping everything else in the world fixed) • Lurking variable: A variable that is associated with both air pollution in a city and mortality in a city. • In order to figure out whether air pollution causes mortality, we want to compare mean mortality among cities with different air pollution levels but the same values of the confounding variables. • If we include all of the lurking variables in the multiple regression model, the coefficient on air pollution represents the change in the mean of mortality that is caused by a one unit increase in air pollution.
Omitted Variables • What happens if we omit a lurking variable from the regression, e.g., percentage of smokers? • Suppose we are interested in the causal effect of on y and believe that there are lurking variables and that • is the causal effect of on y. If we omit the confounding variable, , then the multiple regression will be estimating the coefficient as the coefficient on . How different are and .
Omitted Variables Bias Formula • Suppose that • Then • Formula tells us about direction and magnitude of bias from omitting a variable in estimating a causal effect. • Omitted variable bias: • Formula also applies to least squares estimates, i.e.,