E N D
ALR 8.3 & 9.7 Alicia Hymel
Problem 8.3 This example compares in-field ultrasonic measurements of the depths of defects in the Alaska oil pipeline to measurements of the same defects in a laboratory. The lab measurements were done in six different batches. The goal is to decide if the field measurement can be used to predict the more accurate lab measurement. In this analysis, the lab measurement is the response variable and the field measurement is the predictor variable. The three variables are called: • Field, the in-field measurement • Lab , the more accurate in-lab measurement • Batch, the batch number
Problem 8.3.1 Draw the scatterplot of Lab versus Field and comment on the applicability of the simple linear regression model. > fit831 <- lm(Lab ~ Field) > plot(Field,Lab) > abline(fit831, lty=2, col="red")
Problem 8.3.1 Draw the scatterplot of Lab versus Field and comment on the applicability of the simple linear regression model. In general, the model appears to be a good fit. However, the variance is not constant – variability increases as Field and Lab increase. There also appears to be some grouping of data points in the graph, but I don’t know why that would occur/what it would mean.
Problem 8.3.1 Draw the scatterplot of Lab versus Field and comment on the applicability of the simple linear regression model. > abline(a=0,b=1, lty=1, col="blue") Can the field measurements be used to predict the more accurate lab measurement? The Field measurements underestimate the Lab defect measurements, especially after a depth of approx. 20. However, as mentioned before, there is a definite linear relationship between the two measurement types.
Problem 8.3.1 Draw the scatterplot of Lab versus Field and comment on the applicability of the simple linear regression model. > library(car) > scatterplot(Lab ~ Field | Batch, smooth = TRUE, reg.line = FALSE)
Problem 8.3.1 Draw the scatterplot of Lab versus Field and comment on the applicability of the simple linear regression model. Grouping does not seem to be due to anything strange happening among the different batches.
Problem 8.3.2 Fit the simple regression model, and give the residual plot. Compute the score test for nonconstant variance, and summarize your results. > plot(predict(fit831), residuals(fit831, type="pearson")) > abline(h=0) You can clearly see the “right opening megaphone” pattern here, which indicates nonconstant variance as a function of the fitted values (horizontal axis).
Problem 8.3.2 Fit the simple regression model, and give the residual plot. Compute the score test for nonconstant variance, and summarize your results. > library(car) > ncv.test(fit831) Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 29.58568 Df = 1 p = 5.349868e-08 The p value for the score test is extremely small, indicating non-constant variance.
Problem 8.3.3 Fit the simple regression mean function again, but this time assume that Var(Lab|Field) = σ2 x Field. Get the score test for the fit of this variance function. Also test for nonconstant variance as a function of batch. According to the section on weighted least squares, we can use 1/Field as the weights (since the variance is proportional to a predictor, and each data point represents one data point – not multiple collapsed measurements). > fit833 <- lm(Lab ~ Field, weights=1/Field) > ncv.test(fit833) Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 9.031528 Df = 1 p = 0.002653626 The p-value for the score test is, again, small. Using 1/Field as the weight does not sufficiently account for the variance.
Problem 8.3.3 Fit the simple regression mean function again, but this time assume that Var(Lab|Field) = σ2 x Field. Get the score test for the fit of this variance function. Also test for nonconstant variance as a function of batch. > ncv.test(fit833,~as.factor(Batch)) Non-constant Variance Score Test Variance formula: ~ as.factor(Batch) Chisquare = 6.954998 Df = 5 p = 0.2240088 The score test is not significant. Variability appears to be the same between batches.
Problem 8.3.4 Repeat problem 8.3.3., but with Var(Lab|Field) = σ2 x Field2. > fit834 <- lm(Lab ~ Field, weights=1/Field^2) > ncv.test(fit834) Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 0.02698878 Df = 1 p = 0.8695087 Large p value. This appears to be an appropriate variance function.
Problem 9.7 Refer to the lathe data in Problem 6.2. Review: The data are the results of an experiment on characterizing the life of a drill bit in cutting steel on a lathe. Two factors were varied in the experiment, Speed and Feed rate. The response is Life, the total time until the drill bit fails, in minutes. The other values in the data have been coded by computing: Speed = (Actual speed in feet per minute – 900)/300 Feed = (Actual feed rate in thousandths of an inch per revolution – 13)/6 The coded variables are centered at zero.
Problem 9.7.1 Starting with the full second-order model, use the Box-Cox method to show that an appropriate scale for the response is the logarithmic scale. The full second-order model from Problem 6.2 is: > fit971 <- lm(Life ~ Speed + Feed + I(Speed^2) + I(Feed^2) + Speed:Feed) > library(MASS) > boxcox(fit971, xlab=expression(lambda[y]))
Problem 9.7.1 Starting with the full second-order model, use the Box-Cox method to show that an appropriate scale for the response is the logarithmic scale. Judging from the confidence intervals, 0 appears to be an appropriate value for lambda, indicating a log transformation for the response variable is suitable.
Problem 9.7.2 Find the two cases that are most influential in the fit of the quadratic mean function, and explain why they are influential. Delete these points from the data, refit the quadratic mean function, and compare to the fit with all the data. > fit972 <- lm(logb(Life,2) ~ Speed + Feed + I(Speed^2) + I(Feed^2) + Speed:Feed) > inf.index(fit972)
Problem 9.7.2 Find the two cases that are most influential in the fit of the quadratic mean function, and explain why they are influential. Delete these points from the data, refit the quadratic mean function, and compare to the fit with all the data. Points 9 and 10 have a high Cook’s Distance (used to estimate the influence of a data point / measures the effect of deleting a particular observation). 9 and 10 are the two most influential points. Points 8-10 have a high leverage (may want to look at predictor variables…).
Problem 9.7.2 Find the two cases that are most influential in the fit of the quadratic mean function, and explain why they are influential. Delete these points from the data, refit the quadratic mean function, and compare to the fit with all the data. • Feed Speed Life • 1 -1.000 -1.000 54.5 • 2 -1.000 -1.000 66.0 • 3 1.000 -1.000 11.8 • 4 1.000 -1.000 14.0 • 5 -1.000 1.000 5.2 • 6 -1.000 1.000 3.0 • 7 1.000 1.000 0.8 • 8 1.000 1.000 0.5 • 9 0.000 -1.414 86.5 • 0.000 1.414 0.4 • 11 -1.414 0.000 20.1 • 12 1.414 0.000 2.9 • 13 0.000 0.000 3.8 • 14 0.000 0.000 2.2 • 15 0.000 0.000 3.2 • 16 0.000 0.000 4.0 • 0.000 0.000 2.8 • 18 0.000 0.000 3.2 • 19 0.000 0.000 4.0 • 20 0.000 0.000 3.5 Points with higher leverage have predictor values that were not replicated.
Problem 9.7.2 Find the two cases that are most influential in the fit of the quadratic mean function, and explain why they are influential. Delete these points from the data, refit the quadratic mean function, and compare to the fit with all the data. > noinf <- lathe1[c(1:8,11:20),] > fit972 <- lm(logb(Life,2) ~ Speed + Feed + I(Speed^2) + I(Feed^2) + Speed:Feed, data=lathe1) > fit972noinf <- lm(logb(Life,2) ~ Speed + Feed + I(Speed^2) + I(Feed^2) + Speed:Feed, data=noinf)
Problem 9.7.2 Find the two cases that are most influential in the fit of the quadratic mean function, and explain why they are influential. Delete these points from the data, refit the quadratic mean function, and compare to the fit with all the data. > summary(fit972) Residuals: Min 1Q Median 3Q Max -0.62539 -0.21028 -0.03598 0.24162 0.69237 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.7141 0.1516 11.307 2.00e-08 *** Speed -2.2925 0.1238 -18.520 3.04e-11 *** Feed -1.1401 0.1238 -9.210 2.56e-07 *** I(Speed^2) 0.4156 0.1452 2.863 0.012529 * I(Feed^2) 0.6038 0.1452 4.159 0.000964 *** Speed:Feed -0.1051 0.1516 -0.693 0.499426 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4288 on 14 degrees of freedom Multiple R-squared: 0.9702, Adjusted R-squared: 0.9596 F-statistic: 91.24 on 5 and 14 DF, p-value: 3.551e-10 > summary(fit972noinf) Residuals: Min 1Q Median 3Q Max -0.576550 -0.211505 0.005578 0.215201 0.472956 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.71405 0.11889 14.417 6.11e-09 *** Speed -2.06739 0.11889 -17.388 7.10e-10 *** Feed -1.14006 0.09708 -11.743 6.15e-08 *** I(Speed^2) 0.40427 0.17836 2.267 0.042700 * I(Feed^2) 0.60945 0.13297 4.583 0.000629 *** Speed:Feed -0.10511 0.11889 -0.884 0.394025 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3363 on 12 degrees of freedom Multiple R-squared: 0.9759, Adjusted R-squared: 0.9658 F-statistic: 97.07 on 5 and 12 DF, p-value: 2.804e-09 Beta weights are largely unchanged between the two models. Can’t run a typical anova because of unequal sample size. Added variable plots likewise don’t appear to indicate any large changes.
With influential points (in blue)
Without influential points