Recall what we did on Exercise #2 on Class Handout #7:

Recall what we did on Exercise #2 on Class Handout #7:

2. A company conducts a study to see how diastolic blood pressure is influenced by an employee’s age, weight, and job stress level classified as high stress, some stress, and low stress. Data recorded on 24 employees treated as a random sample is displayed on the right. The data has been stored in the SPSS data file jobstress. Diastolic Job Age Weight Blood Stress (years) (lbs.) Pressure High 23 208 102 High 43 215 126 High 34 175 110 High 65 162 124 High 39 197 120 High 35 160 113 High 29 100 81 High 25 188 100 Some 38 164 97 Some 19 173 93 Some 24 209 92 Some 32 150 93 Some 47 209 120 Some 54 212 115 Some 57 112 93 Some 43 215 116 Low 61 162 103 Low 27 116 81 List the independent variables, and indicate whether each is quantitative or qualitative. (a) age weight job stress level quantitative quantitative qualitative Define dummy variables to represent each qualitative independent variable. (b)

Low 40 142 83 Low 26 116 81 Low 36 160 93 Low 50 212 109 Low 59 201 116 Low 49 217 110 1 for high stress job X 1 = 0 otherwise 1 for some stress job X 2 = 0 otherwise 1 for low stress job X 3 = 0 otherwise Any two of these dummy variables is sufficient to represent the qualitative independent variable job stress level. In the document titled Using SPSS Version 19.0, use SPSS with the section titled Creating new variables by recoding existing variablesto recode the variable jobtype into the first dummy variable in part (b); then repeat this for the other dummy variable(s) in part (b). (c)

2.-continued In the document titled Using SPSS Version 19.0, use SPSS with the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptions to do each of the following: (d) Follow the instructions in the first six steps to graph the least squares line on a scatter plot for the dependent variable with each quantitative independent variable; then decide whether or not the linearity assumption appears to be satisfied. For each of the quantitative predictors, the relationship looks reasonably linear, since the data points appear randomly distributed around the least squares line.

Continue to follow the instructions beginning with the 8th step (notice that step 7 was already done in part (c)) down to the 15th step to create graphs for assessing whether or not the uniform variance (homoscedasticity) assumption and the normality assumption appear to be satisfied, and to generate the output for the linear regression. Then, decide whether or not each of these assumptions appears to be satisfied. The variation looks reasonably uniform.

2(d)-continued The histogram of standardized residuals looks somewhat non-normal, and the points on the normal probability plot seem to depart somewhat from the diagonal line.

Based on the histogram and normal probability plot for the standardized residuals in part (b), explain why we might want to look at the skewness coefficient, the kurtosis coefficient, and the results of the Shapiro-Wilk test. Then use SPSS with the section titled Data Diagnostics to make a statement about whether or not non-normality needs to be a concern. (e) Since there appears to be some possible evidence of non-normality in part (d), we want to know if non-normality needs to be a concern. Since the skewness and kurtosis coefficients are each well within two standard errors of zero, and the p = 0.259 is not less than 0.001 in the Shapiro-Wilk test, non-normality need not be a concern in the regression.

2.-continued From the Correlations table of the SPSS output comment on the possibility of multicollinearity in the multiple regression. (f) Since the correlation matrix does not contain any correlation greater than 0.8 for any pair of independent variables, there is no indication that multicollinearity will be a problem.

2.-continued (g) With a 0.05 significance level, summarize the results of the f-test in the ANOVA table. Since f4, 19= 37.953 and f4, 19; 0.05 = 2.90, we have sufficient evidence to reject H0 at the 0.05 level. We conclude that the linear regression to predict diastolic blood pressure from age, weight, and job stress level is significant (p < 0.001). at least one coefficient in the linear regression to predict diastolic blood pressure from age, weight, and job stress level is different from zero (h) Use the SPSS output to find the least squares regression equation; then explain why all the dummy variables were not included in the model. ^ dbp= 47.999 + 0.584(age) + 0.228(weight)  9.771(X2) 14.260(X 3) The qualitative independent variable job stress level has 3 categories, and only 2 dummy variables are needed to represent a qualitative variable with 3 categories.

(i) Indicate how the least squares regression equation in part (h) describes a separate regression equation for each category of the qualitative independent variable. The 2 dummy variables from part (b) that were included in the model are 1 for some stress job X 2 = 0 otherwise 1 for low stress job X 3 = 0 otherwise For the high stress job group, X2 = X3= 0 so that the least squares regression equation is ^ dbp= 47.999 + 0.584(age) + 0.228(weight)  9.771(0) 14.260(0) = 47.999 + 0.584(age) + 0.228(weight) For the some stress job group, X2 = 1 and X3= 0 so that the least squares regression equation is ^ dbp= 47.999 + 0.584(age) + 0.228(weight)  9.771(1) 14.260(0) = 38.228 + 0.584(age) + 0.228(weight) For the low stress job group, X2 = 0 and X3= 1 so that the least squares regression equation is ^ dbp= 47.999 + 0.584(age) + 0.228(weight)  9.771(0) 14.260(1) = 33.739 + 0.584(age) + 0.228(weight)

2.-continued In the document titled Using SPSS Version 19.0, use SPSS with the five instructions at the end of the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptionsto obtain the output for a stepwise regression. (j) From the Collinearity Statistics section of the Coefficientstable of the SPSS output, add to the comment on the possibility of multicollinearity in the multiple regression. (k) We see that tolerance > 0.10 (i.e., VIF < 10) for each independent variable, which is a further indication that multicollinearitywill not be a problem.

From the Variables Entered/Removed table of the SPSS output, find the default values of the significance level to enter an independent variable into the model and the significance level to remove an independent variable from the model. (l) Respectively these are 0.05 and 0.10.

From the Variables Entered/Removed table of the SPSS output, find the number of steps in the stepwise multiple regression, and list the independent variables selected and removed at each step. (m) There were three steps in the stepwise multiple regression; the variable weight was entered in the first step, the variable age was entered in the second step, and the dummy variable X1 was entered in the third step. No variables were removed at any step.

2.-continued From the Correlations table of the SPSS output, find the ordinary correlation between the dependent variable diastolic blood pressure and the first independent variable entered into the model. (n) The correlation between diastolic blood pressure and weight is 0.727.

From the Excluded Variablestable of the SPSS output, find the partial correlation between the dependent variable diastolic blood pressure and the second independent variable entered into the model given the first independent variable entered into the model; compare this to the ordinary correlation between the dependent variable diastolic blood pressure and the second independent variable entered into the model, which can be found from the Correlations table of the SPSS output. (o) The partial correlation between diastolic blood pressure and age given weight is 0.633. The ordinary correlation between diastolic blood pressure and age is 0.561.

From the Model Summary table of the SPSS output, find and interpretthe change(s) in R2 from the model at one step to the next step. (p) From the model at Step 1, we see that weight accounts for 52.8% of the variance in diastolic blood pressure. From the model at Step 2, we see that weight and age together account for 71.8% of the variance in diastolic blood pressure. With weight already in the model, age accounts for an additional 19.0% of the variance in diastolic blood pressure. From the model at Step 3, we see that weight, age, and the indicator variable for a high stress job together account for 87.3% of the variance in diastolic blood pressure. With weight and age already in the model, the indicator variable for a high stress job accounts for an additional 15.5% of the variance in diastolic blood pressure.

2.-continued From the Coefficients table of the SPSS output, write the estimated regression equation for each step. (q) ^ dbp= 54.266 + 0.280(weight) Step 1: Step 2: Step 3: ^ dbp= 40.653 + 0.249(weight) + 0.478(age) ^ dbp= 35.279 + 0.238(weight) + 0.559(age) + 11.871(X1)

2.-continued From the Coefficients table of the SPSS output, write the estimated regression equation for each step. (q) ^ dbp= 54.266 + 0.280(weight) Step 1: Step 2: Step 3: ^ dbp= 40.653 + 0.249(weight) + 0.478(age) ^ dbp= 35.279 + 0.238(weight) + 0.559(age) + 11.871(X1) For each of the estimated regression coefficients in the estimated regression equation from the final step of the stepwise multiple regression, write a one sentence interpretation describing what the coefficient estimates. (r) For each increase of one pound in weight, diastolic blood pressure increases on average by about 0.238. For each increase of one year in age, diastolic blood pressure increases on average by about 0.559. On average, diastolic blood pressure is about 11.871 greater for employees with a high stress job than for employees with other jobs.

Indicate how the estimated regression equation from the final step of the stepwise multiple regression to predict diastolic blood pressure describes separate regression equations for different groups. (s) In the final step of the stepwise multiple regression, the only dummy variable in the regression equation is 1 for high stress job X 1 = 0 otherwise This suggests a statistically significant difference between the high stress job group and the other two groups combined but no statistically significant difference between the some stress job group and the low stress job group. For the high stress job group, X1 = 1 so that the least squares regression equation is ^ dbp= 35.279 + 0.238(weight) + 0.559(age) + 11.871(1) = 47.150 + 0.238(weight) + 0.559(age) For the low stress or some stress job group, X1 = 0 so that the least squares regression equation is ^ dbp= 35.279 + 0.238(weight) + 0.559(age) + 11.871(0) = 35.279 + 0.238(weight) + 0.559(age)

2.-continued (t) Use the estimated regression equation from the final step of the stepwise multiple regression to predict diastolic blood pressure for each of the two following employees: a 35-year old employee weighing 180 pounds and having a high stress job dbp = 35.279 + 0.238(180) + 0.559(35) + 11.871(1) = 109.555 a 35-year old employee weighing 180 pounds and having a low stress or some stress job dbp = 35.279 + 0.238(180) + 0.559(35) + 11.871(0) = 97.684 3. Read the “INTRODUCTION” and “MULTIPLE LINEAR REGRESSION ANALYSIS” sections of Chapter 4. Open the version of the SPSS data file Job Satisfaction that was saved after Exercise #9 on Class Handout #5. (a) In the “PRACTICAL EXAMPLE” section, read the discussion for assumptions number 1 to 6 in the subsection “Hypothesis Testing”; then, use the Analyze> Descriptive Statistics> Explore options in SPSS to obtain Figure 4.1 and Table 4.1, and use the Graphs> Legacy Dialogs> Scatter/Dot options in SPSS to obtain Figure 4.2. (The other tables and figures displayed in this subsection can be obtained from work to be done in the subsection which follows.)

(b) In the “PRACTICAL EXAMPLE” section, read the discussion for assumptions number 7 and 8 in the subsection “Hypothesis Testing” and the remaining portion of the subsection; then, use the Transform> Recode into Different Variables options in SPSS to create the dummy variables discussed with regard to assumption number 7. Compare the syntax file commands generated by the output with those shown on page 110 of the textbook. (c) In the “PRACTICAL EXAMPLE” section, read the subsection “How to Use SPSS to Compute Multiple Regression Coefficients”, and follow the instructions with SPSS, which should produce much of the output displayed in Table 4.2 to Table 4.12 and in Figures 4.3 and 4.4. Compare the syntax file commands generated by the output with those shown on page 116 of the textbook. Read the remaining portion of Chapter 4.

Exercise #6 on Class Handout #6 concerned a study of the impact of temperature during the summer months on the maximum amount of power that must be generated to meet demand each day; Data for the prediction of daily peak power load (megawatts) from daily high temperature (degrees Fahrenheit) was recorded for 25 randomly selected summer days and stored in the SPSS data file powerloads. 4. From the graph of the least squares line on a scatter plot, displayed here on the right, data points did not appear to be randomly distributed around the least squares line; as temperature increases, the power loads appear to increase at a faster rate. Since the linearity assumption does not appear to be satisfied, it is decided to consider the prediction of daily peak power load from both daily high temperature and the square of daily high temperature.

At the beginning of the handout, the three different types of independent variables that can be included in a multiple regression model are summarized: A multiple regression model refers to a general equation which describes all of the independent variables from which the dependent variable Y is to be predicted. There are three types of independent variables that can be included in a multiple regression model: (1) (2) (3) a quantitative independent variable which is not a function of any other independent variable(s), or a higher-order term which refers to a function of one or more other independent variable(s), or a dummy (indicator) variable with possible values 0 or 1 each representing one of the categories of a dichotomy.

(a) In the document titled Using SPSS Version 19.0, use SPSS with the section titled Creating new variables with transformation of existing variables to create a new variable named temp2which is equal to the square of the variable temp. (b) Both temp and temp2 will be included in the regression model. In the document titled Using SPSS Version 19.0, use SPSS with steps 8 to 15 in the section titled Performing a multiple linear regression with checks of linearity, homoscedasticity, and normality assumptions to create graphs for assessing whether or not the uniform variance (homoscedasticity) assumption and the normality assumption appear to be satisfied, and to generate the output for the linear regression. Then, decide whether or not each of these assumptions appears to be satisfied. The variation looks reasonably uniform.

4(b)-continued The histogram of standardized residuals does not appear to depart too far from a normal curve, and the points on the normal probability plot do not seem to depart drastically from the diagonal line.

(c) With a 0.05 significance level, summarize the results of the f-test in the ANOVA table. Since f2, 22= 259.687 and f2, 22; 0.05 = 3.44, we have sufficient evidence to reject H0 at the 0.05 level. We conclude that the regression to predict daily peak power load from both daily high temperature and the square of daily high temperature is significant (p < 0.001). In order to test whether or not the addition of the temperature squared after temperature is significant, perform the following steps: (d)

Use SPSS to obtain the ANOVA table when only temperature is in the model.

4(d)-continued We shall let SSR(temp, temp2) = the regression sum of squares from the ANOVA table with both temperature and temperature squared in the model, let SSR(temp) = the regression sum of squares from the ANOVA table with only temperature in the model, and let MSE(temp, temp2) = the error mean square from the ANOVA table with both temperature and temperature squared in the model. Calculate the f statistic to decide if the addition of temperature squared to the model after temperature is statistically significant as follows: SSR(temp, temp2)  SSR(temp)  = MSE(temp, temp2)

4(d)-continued We shall let SSR(temp, temp2) = the regression sum of squares from the ANOVA table with both temperature and temperature squared in the model, let SSR(temp) = the regression sum of squares from the ANOVA table with only temperature in the model, and let MSE(temp, temp2) = the error mean square from the ANOVA table with both temperature and temperature squared in the model. Calculate the f statistic to decide if the addition of temperature squared to the model after temperature is statistically significant as follows: SSR(temp, temp2)  SSR(temp)  = MSE(temp, temp2) 15011.772  13196.400  = 28.904 62.807

4(d)-continued We shall let SSR(temp, temp2) = the regression sum of squares from the ANOVA table with both temperature and temperature squared in the model, let SSR(temp) = the regression sum of squares from the ANOVA table with only temperature in the model, and let MSE(temp, temp2) = the error mean square from the ANOVA table with both temperature and temperature squared in the model. Calculate the f statistic to decide if the addition of temperature squared to the model after temperature is statistically significant as follows: SSR(temp, temp2)  SSR(temp)  = MSE(temp, temp2) 15011.772  13196.400  = 28.904 62.807 The numerator degrees of freedom for this f statistic is 1 (the number of variables being added to the model), and the denominator degrees of freedom is the same as the degrees of freedom associated with MSE(temp, temp2). With a 0.05 significance level, summarize the results of this f-test. Since f1, 22= 62.807 and f1, 22; 0.05 = 4.30, we have sufficient evidence to reject H0 at the 0.05 level. We conclude that the addition of squared daily high temperature after temperature in the model to predict daily peak power load is significant (p < 0.001).

From the Coefficients table of the SPSS output, write the estimated regression equation for predicting daily peak power load from both daily high temperature and the square of daily high temperature. (e) ^ powerload= 385.048  8.293(temperature) + 0.060(temperature)2

From the Coefficients table of the SPSS output, write the estimated regression equation for predicting daily peak power load from both daily high temperature and the square of daily high temperature. (e) ^ powerload= 385.048  8.293(temperature) + 0.060(temperature)2 Use the estimated regression equation from part (e) to predict the daily peak power load on a day when the high temperature is 75 degrees Fahrenheit, and also on a day when the high temperature is 85 degrees Fahrenheit. (f) 385.048 8.293(75) + 0.060(75)2= 100.573 megawatts 385.048 8.293(85) + 0.060(85)2= 113.643 megawatts

??????????????????????????? USE THE NEXT PROBLEM FOR HOMEWORK

This exercise makes use of the data stored in the SPSS data file firedam. The prediction of fire damage ($1000s) from distance (miles) from fire station is of interest, and the 15 fires selected for the data set are treated as a random sample for simple linear regression. 2. (a) Identify the dependent (response) variable and the independent (explanatory) variable for a regression analysis. The dependent (response) variable is Y = “fire damage ($1000s)”, and the independent (explanatory) variable is X = “distance from station (miles)”. Does the data appear to be observational or experimental? (b) Since the distances are random, the data is observational (and the only way to collect experimental data would be to deliberately set fires!) Use SPSS to do calculations necessary for simple linear regression. (Part (c) of Class Exercise #1 on this handout can be used as a guide.) (c)

2.-continued (d) Use the SPSS output to find each of the following: n = x = y = r = SSxx = SSyy = SSxy = 15 3.280 26.413 + 0.961 (14)(1.57625)2 = (n – 1)sx2 = 34.7839 (14)(8.06898)2 = (n – 1)sy2 = 911.5181 (0.961) (34.7839)(911.5181) = r SSxx SSyy = 171.1178

(e) (f) (g) (h) Use the SPSS output to find the equation of the least squares line. ^ 1 = 0 = 4.919 ^ The least squares line can be written y = 10.278 + 4.919x . ^ 10.278 Write a one-sentence interpretation of the slope in the least squares line. Fire damage appears to increase on average by about 4.919 thousand dollars ($4919) with each increase of one mile in distance from the fire station. Find the coefficient of determination, and write a one-sentence interpretation. which is the square of r from part (d). 0.923 From the SPSS output, we find r2 = About 92.3% of the variation in fire damage is explained by distance. Find the estimated standard error of the regression. ^ 2.31635 From the SPSS output, we find s = SSyy– 1SSxy which can also be calculated from parts (d)&(e) using n – 2

2.-continued (i) A 0.05 significance level is chosen for a hypothesis test to see if there is any evidence that the linear relationship between distance from the fire station and fire damage is significant, that is, that the slope in the regression is significantly different from zero (0). Perform the test by completing the following table: 1 = 0 Step 1 H0: H1:  = Step 2 Step 3 Find this test statistic value on the SPSS output and on the SAS output of Figure 3.26 in the textbook 1 0 0.05 (two sided) ^ 1 –1(0) 4.919 – 0 t = 12.525 = = s —–— SSxx 2.31635 ———— 34.7839 This is the standard error of the estimated slope; find this value on the SPSS output and on the SAS output of Figure 3.26 in the textbook p-value reject H0 t distribution with df = p < 0.001 from the Student’s t distribution table 13 p < 0.001 from the SPSS output –2.160 t0.025 = 2.160

Step 4 Since t13 = 12.525 and t13;0.025 = 2.160, we have sufficient evidence to reject H0. We conclude that the linear relationship between distance from the fire station and fire damage is significant (p < 0.001). The data suggest that the linear relationship is positive. Considering the results of the hypothesis test, explain why a 95% confidence interval for the slope in the regression would be of interest. Then find and interpret the confidence interval. Since rejecting H0 suggests that the hypothesized zero slope is not correct, a 95% confidence interval will provide us with some information about the slope, which estimates the average change in fire damage with an increase of one mile in distance from the fire station.  — = 2 0.025  — = 2 0.025 ^ ^ s —–— SSxx s —–— SSxx 1– t/2 and 1+ t/2 1 –  = 0.95 df = t0.025 = 13 2.31635 ———— 34.7839 2.31635 ———— 34.7839 4.919 + (2.160) 4.919 – (2.160) and 2.160 4.071 and 5.767 We are 95% confident that the slope in the regression to predict fire damage from distance from fire station is between 4.071 and 5.767 thousand dollars.

2.-continued (j) An insurance company wants to obtain (i) a 95% confidence interval for the mean amount of fire damage among buildings located 3.5 miles from the fire station, and (ii) a 95% prediction interval for the amount of fire damage for a particular building located 3.5 miles from the fire station. Find and interpret each of these intervals. 1 (xp–x)2 — + —–—— n SSxx 1 (xp–x)2 — + —–—— n SSxx ^ ^ and y– t/2 y+ t/2 s s ^ y = 10.278+ 4.919(3.5) = 27.4945 thousand dollars  — = 2 0.025  — = 2 0.025 1 –  = 0.95 df = t0.025 = 13 2.160 1 (3.5 –3.280)2 — + —–———— 15 34.7839 27.4945 – (2.160) 2.31635 1 (3.5 –3.280)2 — + —–———— 15 34.7839 2.31635 27.4945 + (2.160) and 26.19 and 28.80

We are 95% confident that the mean amount of fire damage among buildings located 3.5 miles from the fire station is between 26.19 and 28.80thousand dollars. 1 (xp–x)2 1 + — + —–—— n SSxx 1 (xp–x)2 1 + — + —–—— n SSxx ^ ^ and y– t/2 y+ t/2 s s ^ y = 10.278+ 4.919(3.5) = 27.4945 thousand dollars  — = 2 0.025  — = 2 0.025 1 –  = 0.95 df = t0.025 = 13 1 (3.5 – 3.280)2 1 + — + —–———— 15 34.7839 27.4945 – (2.160) 2.31635 2.160 1 (3.5 – 3.280)2 1 + — + —–———— 15 34.7839 27.4945 + (2.160) 2.31635 and 22.32 and 32.67

We are 95% confident that the amount of fire damage for a randomly selected building located 3.5 miles from the fire stationwill be between 22.32 and 32.67 thousand dollars. OR At least 95% of buildings located 3.5 miles from the fire station wouldhave fire damagebetween 22.32 and 32.67 thousand dollars.

2.-continued (k) Use SPSS to graph the least squares line on a scatter plot. (Part (l) of Class Exercise #1 on this handout can be used as a guide.)

Recall what we did on Exercise #2 on Class Handout #7:

Recall what we did on Exercise #2 on Class Handout #7:

Presentation Transcript

Recall what we did on Exercise #1 on Class Handout #7:

What did we do?

What we did

What we did What we learned What we recommend

What we did

Recall what we did last class in Exercise #2 on Class Handout #9:

What did we do?