410 likes | 420 Views
Explore the relationship between college and high school grade point averages using a simple linear regression model. Understand the assumptions and interpretation of the regression line.
E N D
Chapter 13 Simple Linear Regression and Correlation: Inferential Methods
Suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average. The equation for an additive probabilistic model is: Where e is an “error” variable The first-year college grade point average and the high school grade point average do NOT have a deterministic relationship. Is the first-year college grade point average determined solely by the high school grade point average? A relationship in which the value of y is completely determined by the value of an independent variable x is called a deterministic relationship. A description of the relationship between two variables that are not deterministically related can be given by a probabilistic model.
y x x1 x2 The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, Population regression line (slope b) e1 Without the random deviation e in the equation, all observed (x, y) points would fall exactly on the population regression line. a e2
Basic Assumptions of the Simple Linear Regression Model • The distribution of e at any particular x value has mean value 0. that is, me = 0. • The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s. • The distribution of e at any particular value of x is normal. • The random deviations e1, e2, . . ., en associated with different observations are independent of one another.
Weight Height Let’s look at the heights and weights of a population of adult women. How much would an adult female weigh if she were 5 feet tall? Weights of women that are 5 feet tall will vary – in other words, there is a distribution of weights for adult females who are 5 feet tall. Are some of these weights more likely than others? What would this distribution look like? We want the standard deviations of all these normal distributions to be the same. Where would you expect the population regression line to be? What would you expect for other heights? This distribution is normally distributed.
Basic Assumptions of the Simple Linear Regression Model Revisited • The distribution of e at any particular x value has mean value 0. that is, me = 0. • The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s. • The distribution of e at any particular value of x is normal. • The random deviations e1, e2, . . ., en associated with different observations are independent of one another. The distribution of y at any particular value of x is normal. Remember the variable e is a measure of the extent that individual y-values deviate from the population regression line. For any particular x value, the standard deviation of yequals the standard deviation of e.
a= point estimate of a = y - bx We use to estimate the true population regression line. b = point estimate of b = where Let x* denote a specific value of the predictor variable x. Then a + bx* has two different interpretations: 1. It is a point estimate of the mean y value when x = x*. 2. It is a point prediction of an individual y value to be observed when x = x*.
Baby’s Weight (g) Mother’s Age (yrs) Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). The scatterplot shows a linear pattern and the spread in the y values appears to be similar across the range of x values. This supports the appropriateness of the simple linear regression model. Sketch a scatterplot of these data.
The estimated regression line is: y = -1163.45 + 245.15x Birth Weight Continued . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). Summary statistics computed from the sample data are: Using these summary statistics
Baby’s Weight (g) Mother’s Age (yrs) Birth Weight Continued . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). The weight of babies increase approximately 245.15 grams for each increase of 1 year in the mother’s age. What is the point estimate for the mean weight of babies born to 18-year-old mothers? This is the point estimate for the mean weight of all babies born to 18-year-old mothers. This is also the prediction of the weight of a single baby born to a mother 18 years of age.
The statistic for estimating the variance s2 is where The estimate for the standard deviation s is Recall the coefficient of determination, r2, is the proportion of observed y variation that is attributed to the model relationship. Why n – 2? Note that the degrees of freedom associated with estimating s2 or s in simple linear regression is df = n - 2 Since we must estimate both for a and b in the regression line, we reduce the sample size n by 2 The subscript e reminds us that we are estimating the variance of the “errors” or residuals.
Baby’s Weight (g) Mother’s Age (yrs) Birth Weight Revisited . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). For a particular mother’s age, the typical deviation for possible weights of babies is approximately 231 grams. Approximately 76% of the variability observed weight of babies can be explained by this model. Find SSResid and SSTo. Use this to compute se and r2.
Since s is usually unknown, the estimated standard deviation of the statistic b is Properties of the Sampling Distribution of b When the four basic assumptions of the simple linear regression model are satisfied, the following statements are true: • The mean value of b is b. That is, mb = b. • The standard deviation of the statistic b is • The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed.) Since b is almost always unknown, it must be estimated from independently selected observations. The slope b of the least-squares line gives a point estimate for b.
Confidence Interval for b When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form where the t critical value is based on df = n – 2.
Ski Time (min) Treadmill Time (min) Is cardiovascular fitness (as measured by time to exhaustion from running on a treadmill) related to an athlete’s performance in a 20-km ski race? The following data on x = treadmill time to exhaustion (in minutes) and y = 20-km ski time (in minutes) were taken from the article “Physiological Characteristics and Performance of Top U.S. Biathletes” (Medicine and Science in Sports and Exercise, 1995): The plot shows a linear pattern, and the vertical spread of points does not appear to be changing over the range of x values in the sample. If we assume that the distribution of errors at any given x value is approximately normal, then the simple linear regression model seems appropriate. Sketch a scatterplot for the data.
Ski Time (min) Treadmill Time (min) Biathletes Continued . . . x = treadmill exhaustion time y = ski time We are 95% confident that the true average decrease in ski time associated with a 1 minute increase in treadmill exhaustion time is between 1 minute and 3.7 minutes. Find a 95% confidence interval for the slope of the true regression line.
Biathletes Continued . . . Partial Minitab Output Equation of estimated regression line Estimated y intercept a sb = estimated standard deviation of b Estimated slope b r2 (adjusted) is not used in simple linear regression. 100×r2 se SSResid SSTo n - 2
Summary of Hypothesis Tests Concerning b Null hypothesis: H0: b = hypothesized value Test Statistic: The test is based on df = n – 2. Alternative Hypothesis: P -value: Ha: b > hypothesized valuearea to right of t under the appropriate t curve Ha: b < hypothesized valuearea to left of t under the appropriate t curve Ha: b ≠ hypothesized value2(area to right of t ) if +t or 2(area to left of t ) if -t Often the hypothesized value is zero – this is called the model utility test for simple linear regression.
Summary of Hypothesis Tests Concerning b Continued . . . Assumptions: For this test to be appropriate the four basic assumptions of the simple regression model must be met: • The distribution of e at any particular x value has a mean of 0 (me = 0), • The standard deviation of e is s, which does not depend on x. • The distribution of e at any particular x value is normal. • The random deviations e1, e2, …, en associated with different observations are independent of one another.
Weight Height What is the slope of a horizontal line? Suppose the least-squares line is horizontal –would height be useful in predicting weight? A slope of zero – means that there is NOlinear relationship between x and y!
The Model Utility Test for Simple Linear Regression The model utility test for simple linear regression is the test of H0: b = 0 Ha: b≠ 0 Test Statistic: The null hypothesis specifies that there is no useful linear relationship between x and y.
Ski Time (min) Treadmill Time (min) Biathletes Revisited . . . x = treadmill exhaustion time y = ski time H0: b = 0 Ha: b ≠ 0 Where b is the slope of the population regression line between treadmill time and ski time Even though the scatterplots indicates a linear relationship between ski time and treadmill time, let’s perform the model utility test. P-value = .003 a= .05 df = 9 Since the P-value < a, we reject H0. There is sufficient evidence of a linear relationship between treadmill time and ski time.
Biathletes Revisited . . . Partial Minitab Output t test statistic P-value ÷ = Statistical software usually performs the model utility test with H0: b = 0 versus Ha: b ≠ 0
Checking Model Adequacy The simple linear regression model is y = a + bx + e where e represents the random deviation of an observed y value from the population regression line a + bx. However, we do not know the deviations for e1, e2, …, en because the population regression line is unknown. If we knew the deviations of e1, e2, …, en, we could examine them for any inconsistencies with model assumptions. Therefore, we must estimate these deviations using the residuals from the estimated line. Thus, we use the residuals to check our assumptions. The assumptions for simple linear regression are based on this random deviation e.
Residual Analysis • Standardize the residuals to look at their magnitudes • Create a residual plot (from Chapter 5) or a standardized residual plot (which is a plot of the (x, standardized residual) pairs) Any observation with a large positive or negative residual should be examined carefully for any error in recording data, nonstandard experimental condition, or atypical experimental unit. Most statistical software will perform this calculation. It is tedious to do by hand. A desirable plot is one that exhibits no particular pattern (such as curvature or much greater spread in one part on the plot than the other) and that has no point that is far removed from all the others.
A Look at Standardized Residual Plots This is a desirable plot in that it exhibits no pattern and has no point that lies far away from the other points. Both of these plots contain points far away from the others. These points can have substantial effects on estimates of a and b as well as other quantities. This plot exhibits a curved pattern which indicates that the fitted model should be changed to incorporate the curvature. In this plot, the standard deviation of the residuals increases as the x-values increase. While a straight-line model might still be appropriate, the best-fit line should be found using weighted least-squares. Consult your local statistician!
Ski Time (min) Standardized Residual Normal Score Treadmill Time (min) Biathletes Revisited . . . r = residuals sr = standardized residuals (from Minitab) The normal probability plot of the standardized residuals is quite straight. There is no reason to doubt the plausibility that the random deviations e are normally distributed. Let’s look at a normal probability plot of the standardized residuals
Remember that residuals can also be plotted against y. Standardized Residuals Residuals Treadmill Time Treadmill Time Biathletes Continued . . . r = residuals sr = standardized residuals (from Minitab) Notice that these two plots have similar appearances. The standardized residual plot does not show evidence of any pattern or of increasing spread. Sketch a residual plot. Sketch a standardized residual plot.
Optional Topics Inferences Based on the Estimated Regression Line and Inference about the Population Correlation Coefficient
Properties of the Sampling Distribution of a + bx for a Fixed Value of x Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a +bx* had the following properties: • The mean value of a + bx* is a + bx*, so a + bx* is an unbiased statistic estimating the mean y value when x = x*. • The standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by • The distribution of a + bx* is normal. The farther x* is from the center, the larger sa+bx* is. Since s is unknown, sa+bx* can be estimated by sa+bx*which substitutes se in place of s.
Because sa+bx* is larger the farther x* is from x, the confidence interval becomes wider as x* moves away from the center of the data. Confidence Interval for a Mean y Value When the basic assumptions of the simple linear regression model are met, a confidence interval for a +bx*, the mean y value when x has value x*, is where the t critical value is based on df = n – 2.
Physical characteristics of sharks are of interest to surfers and scuba divers as well as to marine researcher. The data on x = length (in feet) and y = jaw width (in inches) for 44 sharks (were found in various articles appearing in the magazines Skin Diver and Scuba News. (These data are found on page 778 of the text.) Because it is difficult to measure jaw width in living sharks, researchers would like to determine whether it is possible to estimate jaw width from body length, which is more easily measured. This scatterplot of the data shows a linear pattern and is consistent with use of the simple linear regression model.
Jaws Continued . . . The point estimate is Let’s use the data to compute a 90% confidence interval for the mean jaw width for 15 foot long sharks. The model utility test confirms the usefulness of this model. The simple linear regression model explains 76.6% of the variability in jaw width. The estimated standard deviation of a + b(15) is
Jaws Continued . . . The 90% confidence interval is Based on these sample data, we can be 90% confident that the mean jaw width for sharks of length 15 feet is between 14.782 and 15.498 inches.
Prediction Interval for a Single y Value When the basic assumptions of the simple linear regression model are met, a prediction interval for y*, a single y observation made when x = x*, has the form where the t critical value is based on df = n – 2. The prediction interval is wider than the confidence interval due to the due to the addition of se under the square-root symbol. The prediction interval and the confidence interval are centered at exactly the same place, a + bx*.
Suppose that we were interested in predicting the jaw width of a single shark of length 15 feet. Jaws Revisited . . . Notice that this interval is much wider than the confidence interval for the mean jaw width. The 90% prediction interval is We can be 90% confident that an individual shark of length 15 feet will have a jaw width between 12.801 and 17.479 inches.
Also notice that the confidence interval is very narrow close to x, but widens the farther it is from the mean. Below is a Regression Plot from Minitab showing the confidence interval and the prediction interval for the shark data. Notice that the prediction interval is substantial wider than the confidence interval
A Test for Independence in a Bivariate Normal Population Null Hypothesis: H0: r = 0 Test Statistic: The test is based on df = n – 2. Alternative Hypothesis: P-value: Ha: r > 0 (positive dependence)Area to the right of t Ha: r < 0 (negative dependence)Area to the left of t Ha: r ≠ 0 (dependence)2(Area to the right of t) if +t or 2(Area to the left of t) if -t Greek letter “rho” r is the population correlation coefficient. It assesses the extent of any linear relationship in the population. r must be between -1 and 1. Many investigators are interested if ANY relationship exist between x and y. That is, are x and y are independent of each other? However, r = 0 is NOT equivalent to x and y being independent except in the case of a bivariate normal population. A bivariate normal population is one where for any fixed x value, the distribution of associated y values is normal, and for any fixed y value, the distribution of x values is normal. An example would be the height x and weight y of American adult males.
A Test for Independence in a Bivariate Normal Population Assumptions: r is the correlation coefficient for a random sample from a bivariate normal population. The one way to verify that the population is a bivariate normal population is to plot individual normal probability plots of the x and y variables.
The relationship between sleep duration and the level of the hormone leptin ( a hormone related to energy intake and energy expenditure) in the blood was investigated. Average nightly sleep (x, in hours) and blood leptin level (y) were recorded for each person in a sample of 716 participants in the Wisconsin Sleep Cohort Study. The sample correlation coefficient was r = 0.11. Does this support the claim that short sleep duration is associated with reduced leptin? Use a = .01. H0: r = 0 Ha: r > 0 Test Statistic: Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans To verify the assumptions, we would look at normal probability plots of the x values and of the y values. However, data is not available, so we will assume the bivariate normal population is reasonable. We will also assume that it is reasonable to regard the sample of participants as representative of the population of adult Americans. State the hypotheses. P-value = .0015 df = 714a= .01
Sleepless Nights Continued . . . H0: r = 0 Ha: r > 0 Test Statistic: Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans P-value = .0015 df = 714a= .01 Note: the hypothesis of no linear relationship (H0: b = 0) can also be used to test for independence in a bivariate normal population. Since the P-value < .01, we reject H0. There is evidence to suggest that there is a positive association (perhaps a weak one since r = .11) between sleep duration and blood leptin level.