350 likes | 480 Views
Recall what we did on Exercise #1 on Class Handout #7:. 1.
E N D
1. The data stored in the SPSS data file realestate is to be used in a study concerning the prediction of sale price of a residential property (dollars). Appraised land value (dollars), appraised value of improvements (dollars), and area of property living space (square feet) are to be considered as possible predictors, and the 20 properties selected for the data set are a random sample. Does the data appear to be observational or experimental? (a) Since the land value, improvement value, and area are all random, the data is observational. In the document titled Using SPSS Version 19.0, use SPSS with the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptions to do each of the following: (b) Follow the instructions in the first six steps to graph the least squares line on a scatter plot for the dependent variable with each quantitative independent variable; then decide whether or not the linearity assumption appears to be satisfied.
For each of the quantitative predictors, the relationship looks reasonably linear, since the data points appear randomly distributed around the least squares line.
1(b)-continued Continue to follow the instructions beginning with the 8th step (notice that step 7 is not necessary here) down to the 15th step to create graphs for assessing whether or not the uniform variance (homoscedasticity) assumption and the normality assumption appear to be satisfied, and to generate the output for the linear regression. Then, decide whether or not each of these assumptions appears to be satisfied. There appears to be much variation, but it looks reasonably uniform.
The histogram of standardized residuals looks somewhat non-normal, and the points on the normal probability plot seem to depart somewhat from the diagonal line.
1.-continued Based on the histogram and normal probability plot for the standardized residuals in part (b), explain why we might want to look at the skewness coefficient, the kurtosis coefficient, and the results of the Shapiro-Wilk test. Then use SPSS with the section titled Data Diagnostics to make a statement about whether or not non-normality needs to be a concern. (c) Since there appears to be some possible evidence of non-normality in part (b), we want to know if non-normality needs to be a concern. Since the skewness and kurtosis coefficients are each well within two standard errors of zero, and the p = 0.166 is not less than 0.001 in the Shapiro-Wilk test, non-normality need not be a concern in the regression.
From the Correlations table of the SPSS output comment on the possibility of multicollinearity in the multiple regression. (d) Since the correlation matrix does not contain any correlation greater than 0.8 for any pair of independent variables, there is no indication that multicollinearity will be a problem.
(e) With a 0.05 significance level, summarize the results of the f-test in the ANOVA table. Since f3, 16= 46.662 and f3, 16; 0.05 = 3.24, we have sufficient evidence to reject H0 at the 0.05 level. We conclude that the linear regression to predict sale price from land value, improvements value, and area is significant (p < 0.001). at least one coefficient in the linear regression to predict sale price from land value, improvements value, and area is different from zero (f) Use the SPSS output to find the least squares regression equation. ^ sale_prc = 1470.276 + 0.814(land_val) + 0.820(impr_val) + 13.529(area)
1.-continued In the document titled Using SPSS Version 19.0, use SPSS with the five instructions at the end of the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptionsto obtain the output for a stepwise regression. (g)
1.-continued From the Collinearity Statistics section of the Coefficientstable of the SPSS output, add to the comment on the possibility of multicollinearity in the multiple regression. (h) We see that tolerance > 0.10 (i.e., VIF < 10) for each independent variable, which is a further indication that multicollinearitywill not be a problem. From the Variables Entered/Removed table of the SPSS output, find the default values of the significance level to enter an independent variable into the model and the significance level to remove an independent variable from the model. (i) Respectively these are 0.05 and 0.10.
From the Variables Entered/Removed table of the SPSS output, find the number of steps in the stepwise multiple regression, and list the independent variables selected and removed at each step. (j) There were two steps in the stepwise multiple regression; the variable “appraised value of improvements” was entered in the first step, and the variable “area of property living space” was entered in the second step. No variables were removed at either step.
From the Variables Entered/Removed table of the SPSS output, find the number of steps in the stepwise multiple regression, and list the independent variables selected and removed at each step. (j) There were two steps in the stepwise multiple regression; the variable “appraised value of improvements” was entered in the first step, and the variable “area of property living space” was entered in the second step. No variables were removed at either step. From the Correlations table of the SPSS output, find the ordinary correlation between the dependent variable sale price and the first independent variable entered into the model. (k) The ordinary (Pearson) correlation between two variables has been defined previously as a measure of strength of the linear relationship. A partial correlation is defined to be a measure of strength of the linear relationship between two variables given one or more other variables. The partial correlation between Y and a specific Xi given all of the Xis in the model is one basis for deciding whether the specific Xi should be added to the model after these Xis in the model. As an example, consider how, among school children, the correlation between grip strength and height would be different from the partial correlation between grip strength and height given age. Note: The text states that a standardized regression coefficient i is a partial correlation, but this is not correct.
From the Variables Entered/Removed table of the SPSS output, find the number of steps in the stepwise multiple regression, and list the independent variables selected and removed at each step. (j) There were two steps in the stepwise multiple regression; the variable “appraised value of improvements” was entered in the first step, and the variable “area of property living space” was entered in the second step. No variables were removed at either step. From the Correlations table of the SPSS output, find the ordinary correlation between the dependent variable sale price and the first independent variable entered into the model. (k) The correlation between sale price and appraised value of improvements is 0.916.
1.-continued From the Excluded Variablestable of the SPSS output, find the partial correlation between the dependent variable sale price and the second independent variable entered into the model given the first independent variable entered into the model; compare this to the ordinary correlation between the dependent variable sale price and the second independent variable entered into the model, which can be found from the Correlations table of the SPSS output. (l) The partial correlation between sale price and area of property living space given appraised value of improvements is 0.515. The ordinary correlation between sale price and area of property living space is 0.849.
A more detailed analysis of multiple regression data involves whether or not the addition of one or more independent variables (factors) is added to the multiple regression model after one or more independent variables (factors) are already in the model is statistically significant. When one or more independent variables (factors) is added to a multiple regression model, then (1) the total sum of squares remains the same, (2) the regression sum of squares increases, (3) the error sum of squares decreases. Regression Sum of Squares Total Sum of Squares The multiple R square is R2 = is the proportion (often converted to a percentage) of variation in the dependent variable Y accounted for by (or explained by) all of the independent variables X1 , X2 , … , Xk. The (positive) square root of R2 is sometimes called the multiple correlation coefficient. However, only the strength of the relationship between Y and more than one predictor can be considered; the direction of a relationship (positive or negative) can only be considered between Y and one predictor. When one or more independent variables (factors) is added to a multiple regression model, then since the total sum of squares remains the same and the regression sum of squares increases, then the value of R2 must increase.
From the Model Summary table of the SPSS output, find and interpret the change(s) in R2 from the model at one step to the next step. (m) From the model at Step 1, we see that “appraised value of improvements” accounts for 83.8% of the variance in “sale price”. From the model at Step 2, we see that “appraised value of improvements” and “area of property living space” together account for 88.1% of the variance in “sale price”. With “appraised value of improvements” already in the model, “area of property living space” accounts for an additional 4.3% of the variance in “sale price”.
1.-continued From the Coefficients table of the SPSS output, write the estimated regression equation for each step. (n) ^ sale_prc = 8945.575 + 1.351(impr_val) Step 1: Step 2: ^ sale_prc = 97.521 + 0.960(impr_val) + 16.373(area)
1.-continued From the Coefficients table of the SPSS output, write the estimated regression equation for each step. (n) ^ sale_prc = 8945.575 + 1.351(impr_val) Step 1: Step 2: ^ sale_prc = 97.521 + 0.960(impr_val) + 16.373(area) Use the estimated regression equation from the final step of the stepwise multiple regression to predict the sale price of a residential property where the appraised land value is $8000, the appraised value of improvements is $20,000, and area of property living space is 1200 square feet. (o) 97.521 + 0.960(20000) + 16.373(1200) = $38,945.12
For each of the estimated regression coefficients in the estimated regression equation from the final step of the stepwise multiple regression, write a one sentence interpretation describing what the coefficient estimates. (p) For each increase of one dollar in appraised value of improvements, the sale price increases on average by about $0.96. For each increase of one square foot in area of property living space, the sale price increases on average by about $16.37.
2. A company conducts a study to see how diastolic blood pressure is influenced by an employee’s age, weight, and job stress level classified as high stress, some stress, and low stress. Data recorded on 24 employees treated as a random sample is displayed on the right. The data has been stored in the SPSS data file jobstress. Diastolic Job Age Weight Blood Stress (years) (lbs.) Pressure High 23 208 102 High 43 215 126 High 34 175 110 High 65 162 124 High 39 197 120 High 35 160 113 High 29 100 81 High 25 188 100 Some 38 164 97 Some 19 173 93 Some 24 209 92 Some 32 150 93 Some 47 209 120 Some 54 212 115 Some 57 112 93 Some 43 215 116 Low 61 162 103 Low 27 116 81 List the independent variables, and indicate whether each is quantitative or qualitative. (a) age weight job stress level quantitative quantitative qualitative Define dummy variables to represent each qualitative independent variable. (b)
A dummy (indicator) variable is one defined to be 1 if a given condition is satisfied and 0 otherwise. Suppose a qualitative-dichotomous variable is to be used in a regression model to predict a dependent variable Y. If we label the categories (levels) of the qualitative-dichotomous variable as #1 and #2, then this variable can be represented by defining an appropriate dummy variable, such as 1 for category #1 X = 0 for category #2 A regression equation to predict Y from X can be written as Y = a + bX . ^ ^ a + b(0) = a . When X = 0, then the predicted value for Y is Y = ^ a + b(1) = When X = 1, then the predicted value for Y is Y = a + b . b = amount that predicted Y for category #1 exceeds predicted Y for category #2.
Suppose a qualitative variable with 3 categories (levels) is to be used in a regression model to predict a dependent variable Y. If we label the categories as #1, #2, and #3, then this qualitative variable can be represented by defining two appropriate dummy variables, such as 1 for category #1 X1 = 0 otherwise 1 for category #2 X2 = 0 otherwise and A regression equation to predict Y from X1 and X2 can be written as Y = a + b1X1 + b2X2 . ^ ^ a + b1(0) + b2(0) = a . When X1 = 0 and X2 = 0, then the predicted value for Y is Y = ^ a + b1(1) + b2(0) = When X1 = 1 and X2 = 0, then the predicted value for Y is Y = a + b1 . ^ a + b1(0) + b2(1) = When X1 = 0 and X2 = 1, then the predicted value for Y is Y = a + b2 . b1 = amount that predicted Y for category #1 exceeds predicted Y for category #3. b2 = amount that predicted Y for category #2 exceeds predicted Y for category #3. In practice, which categories are associated with which dummy variables does not matter. The category which is not associated with any dummy variable is sometimes called the reference group (since each coefficient in the regression model represents a difference in mean when this group is compared to one other group).
Suppose a qualitative variable with k categories (levels) is to be used in a regression model to predict a dependent variable Y. If we label the categories as #1, #2, …, #k, then this qualitative variable can be represented by defining k– 1 appropriate dummy variables, such as 1 for category #1 X1 = 0 otherwise 1 for category # k– 1 Xk– 1 = 0 otherwise . . . A regression equation to predict Y from X1 , X2 , … , Xk 1 can be written as Y = a + b1X1 + b2X2 + … + bk 1 Xk 1 . ^ ^ When X1= X2= … Xk 1 = 0, then the predicted value for Y is Y = a . ^ When X1= 1 and X2= … = Xk 1 = 0, then the predicted value for Y is Y = a + b1 . ^ When X2= 1 and X1= X3= … = Xk 1 = 0, then the predicted value for Y is Y= …, etc. a + b2 . bi = amount that predicted Y for category #i exceeds predicted Y for category # k. When a qualitative variable has k categories (levels) with k > 2, the k– 1 dummy variables X1 , X2 , … , Xk– 1 are treated as a group so that either all of them are included in the model or none of them are included in the model. An alternative approach (used in the textbook) is to define one more dummy variable Xk corresponding to category #k, and treating the k dummy variables as separate, individual variables.
A multiple regression model where quantitative dependent variable Y is predicted from only dummy variables which represent one qualitative variable is a model for a one-way analysis of variance (ANOVA). Now return to Exercise #2: Define dummy variables to represent each qualitative independent variable. (b)
Low 40 142 83 Low 26 116 81 Low 36 160 93 Low 50 212 109 Low 59 201 116 Low 49 217 110 1 for high stress job X 1 = 0 otherwise 1 for some stress job X 2 = 0 otherwise 1 for low stress job X 3 = 0 otherwise Any two of these dummy variables is sufficient to represent the qualitative independent variable job stress level. In the document titled Using SPSS Version 19.0, use SPSS with the section titled Creating new variables by recoding existing variablesto recode the variable jobtype into the first dummy variable in part (b); then repeat this for the other dummy variable(s) in part (b). (c)
2.-continued In the document titled Using SPSS Version 19.0, use SPSS with the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptions to do each of the following: (d) Follow the instructions in the first six steps to graph the least squares line on a scatter plot for the dependent variable with each quantitative independent variable; then decide whether or not the linearity assumption appears to be satisfied. For each of the quantitative predictors, the relationship looks reasonably linear, since the data points appear randomly distributed around the least squares line.
Continue to follow the instructions beginning with the 8th step (notice that step 7 was already done in part (c)) down to the 15th step to create graphs for assessing whether or not the uniform variance (homoscedasticity) assumption and the normality assumption appear to be satisfied, and to generate the output for the linear regression. Then, decide whether or not each of these assumptions appears to be satisfied. The variation looks reasonably uniform.
2(d)-continued The histogram of standardized residuals looks somewhat non-normal, and the points on the normal probability plot seem to depart somewhat from the diagonal line.
Based on the histogram and normal probability plot for the standardized residuals in part (b), explain why we might want to look at the skewness coefficient, the kurtosis coefficient, and the results of the Shapiro-Wilk test. Then use SPSS with the section titled Data Diagnostics to make a statement about whether or not non-normality needs to be a concern. (e) Since there appears to be some possible evidence of non-normality in part (d), we want to know if non-normality needs to be a concern. Since the skewness and kurtosis coefficients are each well within two standard errors of zero, and the p = 0.259 is not less than 0.001 in the Shapiro-Wilk test, non-normality need not be a concern in the regression.
2.-continued From the Correlations table of the SPSS output comment on the possibility of multicollinearity in the multiple regression. (f) Since the correlation matrix does not contain any correlation greater than 0.8 for any pair of independent variables, there is no indication that multicollinearity will be a problem.
2.-continued (g) With a 0.05 significance level, summarize the results of the f-test in the ANOVA table. Since f4, 19= 37.953 and f4, 19; 0.05 = 2.90, we have sufficient evidence to reject H0 at the 0.05 level. We conclude that the linear regression to predict diastolic blood pressure from age, weight, and job stress level is significant (p < 0.001). at least one coefficient in the linear regression to predict diastolic blood pressure from age, weight, and job stress level is different from zero
(h) Use the SPSS output to find the least squares regression equation; then explain why all the dummy variables were not included in the model. ^ dbp= 47.999 + 0.584(age) + 0.228(weight) 9.771(X2) 14.260(X 3) The qualitative independent variable job stress level has 3 categories, and only 2 dummy variables are needed to represent a qualitative variable with 3 categories.
(i) Indicate how the least squares regression equation in part (h) describes a separate regression equation for each category of the qualitative independent variable. The 2 dummy variables from part (b) that were included in the model are 1 for some stress job X 2 = 0 otherwise 1 for low stress job X 3 = 0 otherwise For the high stress job group, X2 = X3= 0 so that the least squares regression equation is ^ dbp= 47.999 + 0.584(age) + 0.228(weight) 9.771(0) 14.260(0) = 47.999 + 0.584(age) + 0.228(weight) For the some stress job group, X2 = 1 and X3= 0 so that the least squares regression equation is ^ dbp= 47.999 + 0.584(age) + 0.228(weight) 9.771(1) 14.260(0) = 38.228 + 0.584(age) + 0.228(weight) For the low stress job group, X2 = 0 and X3= 1 so that the least squares regression equation is ^ dbp= 47.999 + 0.584(age) + 0.228(weight) 9.771(0) 14.260(1) = 33.739 + 0.584(age) + 0.228(weight) We shall complete this exercise next class.