210 likes | 350 Views
Lecture 24: Thurs., April 8th. Inference for Multiple Regression. Types of inferences: Confidence intervals/hypothesis tests for regression coefficients Confidence intervals for mean response, prediction intervals Overall usefulness of predictors (F-test, R-squared)
E N D
Inference for Multiple Regression Types of inferences: • Confidence intervals/hypothesis tests for regression coefficients • Confidence intervals for mean response, prediction intervals • Overall usefulness of predictors (F-test, R-squared) • Effect tests (we will cover these later when we cover categorical explanatory variables)
Overall usefulness of predictors • Are any of the predictors useful? Does the mean of y change as any of the explanatory variables changes. • vs. at least one of ‘s does not equal zero. • Test (called overall F test) is carried out in Analysis of Variance table. We reject for large values of F statistic. Prob>F is the p-value for this test. • For fish mercury data, Prob>F less than 0.0001 – strong evidence that at least one of length/weight is a useful predictor of mercury concentration.
The R-Squared Statistic • P-value from overall F test tests whether any of predictors are useful but does not give a measure of how useful the predictors are. • R squared is a measure of how good the predictions from the multiple regression model are compared to using the mean of y, i.e., none of the predictors, to predict y. • Similar interpretation as in simple linear regression. The R-squared statistic is the proportion of the variation in y explained by the multiple regression model • Total Sum of Squares: • Residual Sum of Squares:
Air Pollution and Mortality • Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961. • The variables are • y (MORT)=total age adjusted mortality in deaths per 100,000 population; • PRECIP=mean annual precipitation (in inches); EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of Nox (related to amount of tons of Nox emitted per day per square kilometer); SO2=relative pollution potential of SO2
Multiple Regression and Causal Inference • Goal: Figure out what the causal effect on mortality would be of decreasing air pollution (and keeping everything else in the world fixed) • Confounding variable: A variable that is related to both air pollution in a city and mortality in a city. • In order to figure out whether air pollution causes mortality, we want to compare mean mortality among cities with different air pollution levels but the same values of the confounding variables. • If we include all of the confounding variables in the multiple regression model, the coefficient on air pollution represents the change in the mean of mortality that is caused by a one unit increase in air pollution.
Omitted Variables • What happens if we omit a confounding variable from the regression, e.g., percentage of smokers? • Suppose we are interested in the causal effect of on y and believe that there are confounding variables and that • is the causal effect of on y. If we omit the confounding variable, , then the multiple regression will be estimating the coefficient as the coefficient on . How different are and .
Omitted Variables Bias Formula • Suppose that • Then • Formula tells us about direction and magnitude of bias from omitting a variable in estimating a causal effect. • Formula also applies to least squares estimates, i.e.,
Assumptions of Multiple Linear Regression Model • Assumptions of multiple linear regression: • For each subpopulation , • (A-1A) • (A-1B) • (A-1C) The distribution of is normal [Distribution of residuals should not depend on ] • (A-2) The observations are independent of one another
Checking/Refining Model • Tools for checking (A-1A) and (A-1B) • Residual plots versus predicted (fitted) values • Residual plots versus explanatory variables • If model is correct, there should be no pattern in the residual plots • Tool for checking (A-1C) • Histogram of residuals • Tool for checking (A-2) • Residual plot versus time or spatial order of observations
Model Building • Make scatterplot matrix of variables (using analyze, multivariate). Decide on whether to transform any of the explanatory variables. • Fit model. • Check residual plots for whether assumptions of multiple regression model are satisfied. Also look for outliers and influential points. • Make changes to model and repeat steps 2-3 until an adequate model is found.
2. a) From the scatter plot of MORT vs. NOX we see that NOX values are crunched very tight. A Log transformation of NOX is needed. b) The curvature in MORT vs. SO2 indicates a Log transformation for SO2 may be suitable. After the two transformations we have the following correlations:
Dealing with Influential Observations • By influential observations, we mean one or several observations whose removal causes a different conclusion or course of action. • Display 11.8 provides a strategy for dealing with suspected influential cases.
Cook’s Distance • Cook’s distance is a statistic that can be used to flag observations which are influential. • After fit model, click on red triangle next to Response, Save columns, Cook’s D influence. • Cook’s distance of close to or larger than 1 indicates a large influence.
Leverage Plots • The leverage plots produced by JMP provide a “simple regression view” of a multiple regression coefficient. (The leverage plot for variable is a plot of vs. multiple regression residuals.) • Slope of line shown in leverage plot is equal to the coefficient for that variable in the multiple regression. • Distances from the points to the line in leverage plot are multiple regression residuals. Distance from point to horizontal line is the residual if the explanatory variable is not included in the model. • These plots are used to identify outliers, leverage, and influential points for the particular regression coefficient in the multiple regression. (Use them the same way as in a simple regression.)
The influential points can have extreme impact on the analysis An alternative model Because of the importance of NOX and SO2, One could choose the final model to be: MORTvs.PRECIP,NONWHITE, EDUC and log Nox and log SO2 Notice that even though log Nox is not significant, one could still leave it in the model.
The enlarged observation New Orleans is an outlier for estimating each coefficient and is highly leveraged for estimating the coefficients of interest on log Nox and log SO2. Since New Orleans is both highly leveraged and an outlier, we expect it to be influential.