270 likes | 455 Views
Stat 112: Lecture 13 Notes. Finish Chapter 5: Review Predictions in Log-Log Transformation. Polynomials and Transformations in Multiple Regression Start Chapter 6: Checking Assumptions of Multiple Regression and Remedies for the Assumptions.
E N D
Stat 112: Lecture 13 Notes • Finish Chapter 5: • Review Predictions in Log-Log Transformation. • Polynomials and Transformations in Multiple Regression • Start Chapter 6: Checking Assumptions of Multiple Regression and Remedies for the Assumptions. • Schedule: Homework 4 will be assigned next week and due Thursday, Nov. 2nd.
Another Example of Transformations: Y=Count of tree seeds, X= seed weight
By looking at the root mean square error on the original y-scale, we see that Both of the transformations improve upon the untransformed model and that the transformation to log y and log x is by far the best.
Prediction using the log y/log x transformation • What is the predicted seed count of a tree that weights 50 mg? • Math trick: exp{log(y)}=y (Remember by log, we always mean the natural log, ln), i.e.,
Polynomials and Transformations in Multiple • Example: Fast Food Locations. An analyst working for a fast food chain is asked to construct a multiple regression model to identify new locations that are likely to be profitable. The analyst has for a sample of 25 locations the annual gross revenue of the restaurant (y), the mean annual household income and the mean age of children in the area. Data in fastfoodchain.jmp
Scatterplot Matrix There seems to be a nonlinear relationship between revenue and income and between revenue and age.
Polynomials and Transformations for Multiple Regression in JMP • For multiple regression, transformations can be done by creating a new column, right clicking and clicking formula to create new formula. • Polynomials can be added by using Fit Model and then highlighting the X variable in both the Select Columns box and the Construct Model Effects Box and then clicking cross. • For choosing the order of the polynomials, we use the same procedure as in simple regression, making the polynomials higher order until the coefficient on the highest order term is not significant.
Chapter 6: Checking the Assumptions of the Regressions Model and Remedies for When the Assumptions are Not Met
Assumptions of Multiple Linear Regression Model • Linearity: • Constant variance: The standard deviation of Y for the subpopulation of units with is the same for all subpopulations. • Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations. • The observations are independent.
Assumptions for linear regression and their importance to inferences
Checking Linearity • Plot residuals versus each of the explanatory variables. Each of these plots should look like random scatter, with no pattern in the mean of the residuals. Residual Plot: Use Fit Y by X with Y being Residuals. Fit Line will draw horizontal Line. If residual plots show a problem, then we could try to transform the x-variable and/or the y-variable.
Residual Plots in JMP • After Fit Model, click red triangle next to Response, click Save Columns and click Residuals. • Use Fit Y by X with Y=Residuals and X the explanatory variable of interest. Fit Line will draw a horizontal line with intercept zero. It is a property of the residuals from multiple linear regression that a least squares regression of the residuals on an explanatory variable has slope zero and intercept zero.
Residual by Predicted Plot • Fit Model displays the Residual by Predicted Plot automatically in its output. • The plot is a plot of the residuals versus the predicted Y’s, We can think of the predicted Y’s as summarizing all the information in the X’s. As usual we would like this plot to show random scatter. • Pattern in the mean of the residuals as the predicted Y’s increase: Indicates problem with linearity. Look at residual plots versus each explanatory variable to isolate problem and consider transformations. • Pattern in the spread of the residuals: Indicates problem with constant variance.
Corrections for Violations of the Linearity Assumption • When the residual plot shows a pattern in the mean of the residuals for one of the explanatory variables Xj, we should consider: • Transforming the Xj. • Adding polynomial variables in Xj— • Transforming Y • After making the transformation/adding polynomials, we need to refit the model and look at the new residual plot vs. X to see if linearity has been achieved.
Checking Constant Variance Assumption • Residual plot versus explanatory variables should exhibit constant variance. • Residual plot versus predicted values should exhibit constant variance (this plot is often most useful for detecting nonconstant variance)
^ y + + + + + + + + + + + + + + + + + + + + + + + ^ The spread increases with y Heteroscedasticity • When the requirement of a constant variance is violated we have a condition of heteroscedasticity. • Diagnose heteroscedasticity by plotting the residual against the predicted y. Residual + + + + + + + + + + + + + ^ + + + y + + + + + + + +
How much traffic would a building generate? • The goal is to predict how much traffic will be generated by a proposed new building of 150,000 occupied sq ft. (Data is from the MidAtlantic States City Planning Manual.) • The data tells how many automobile trips per day were made in the AM to office buildings of different sizes. • The variables are x = “Occupied Sq Ft of floor space in the building (in 1000 sq ft)” and Y = “number of automobile trips arriving at the building per day in the morning”.
Reducing Nonconstant Variance/Nonnormality by Transformations • A brief list of transformations • y’ = y1/2 (for y > 0) • Use when the s2e increases with • y’ = log y (for y > 0) • Use when the se increases with • Use when the error distribution is skewed to the right. • y’ = y2 • Use when the s2e is decreasing with , or • Use when the error distribution is left skewed
To try to fix heteroscedasticity we transform Y to Log(Y) This fixes hetero… BUT it creates a nonlinear pattern.
To fix nonlinearity we now transform x to Log(x), without changing the Y axis anymore. The resulting pattern is both satisfactorily homoscedastic AND linear.
Often we will plot residuals versus predicted. For simple regression the two residual plots are equivalent