190 likes | 361 Views
Chapter 15: Describing Relationships: Regression, Prediction, and Causation. If we have a strong linear correlation between two variables, then we can use a linear regression model to predict the value of a response variable, y , based on an explanatory variable, x .
E N D
Chapter 15: Describing Relationships: Regression, Prediction, and Causation • If we have a strong linear correlation between two variables, then we can use a linear regression model to predict the value of a response variable, y, based on an explanatory variable, x. • A regression line is a straight line that describes how a response variable, y, changes as an explanatory variable, x, changes. We often use a regression line to predict the value of y for a given value of x. (p. 284)
To use simple linear regression: • Create a scatterplot to see if a linear relationship is reasonable. • Fit a straight line with the least deviation using a computer program (DoStat). • Predict values of y given a value of x by substituting the value of x into the equation and solving for the value of y.
Least Squares Regression The least-squares regression line of y on x is the line that makes the sum of the squared vertical distances of the data points from the fitted line as small as possible. (See Figure 15.3 on p. 287) The equation of a line: y = a + b*x • a is the y-intercept, when x = 0, y = a • b is the slope or average rate of change; expected change in y-variable when the x-variable increases by one unit. • Variables in the formula: x = explanatory variable y = response variable
Example: Shoe Size and Height There is a strong positive association between height and shoe size for males. • Let’s develop a prediction equation to predict shoe size based on height. • Let y = shoe size (response variable) and x = height (explanatory variable) • Predict the shoe size of a male who is 70” tall. • Predict the shoe size of a male who is 6 ft. 2 in. tall. • For each additional inch a male is tall, what is the expected change in his shoe size?
Additional Notes on Regression The sign of r and b will always be the same. • If the slope is positive, then there is a positive association • If the slope is negative, then there is a negative association between the two variables. Ouliers can affect the value of r and the regression equations.
Understanding prediction (p. 289) • Prediction is based on fitting some “model” to the data. • Prediction works best when the model fits the data closely. Will get better predictions if data have a tight linear relationship – compare Figure 15.1 on p. 285 and Figure 15.2 on p. 286. • Outliers affect the regression line, see Figure 15.4 on p. 291.
Extrapolation • Prediction outside the range of the data is risky and not appropriate. Predictions can be grossly inaccurate. This is called extrapolation. • For our height and shoe size example, the prediction formula was developed on adult males with heights between 65” and 75”. • Is it appropriate to use the formula to predict the shoe size of a child who is, say, 40” tall?
Correlation and regression • The square of the correlation, r2, is the proportion of variation in the values of y that is explained by the regression model with x. (p. 290) • Given on a plot with the regression line in DoStat. • 0 r2 1 always. The closer r2 is to 1, the more confident we are in our prediction. If r = 0.7, then r2 = 0.49, ½ of the variation. • For the Height and Shoe Size example, r2 = 0.7003 About 0.7 or 70% of the variation in shoe size is explained by a linear relationship with height.
The question of causation • Example of causation: Increased drinking of alcohol causes a decrease in coordination. • Example of association: High SAT scores are associated with a high Freshman year GPA. • How do we determine causation? This is not a simple task since it is rarely the case that A “is the cause of” B; rather, A is a contributory cause of B.
Statistics and causation (pp. 292 and 296) 1. A strong relationship between two variables does not always mean the changes in one variable causes changes in the other. 2. The relationship between two variables is often influenced by other variables lurking in the background or a common response. 3. The best evidence for causation comes from properly designed randomized comparative experiments.
More Causation 4. The observed relationship between two variables may be due to causation, common response, or confounding. All or some may be present together. 5. An observed relationship can be used for prediction without worrying about causation as long as past patterns continue. • Figure 15.5 on p. 294: Causation, common response, and confounding. • Examples 6, 7, and 8 on p. 293 and 296.
The case for the claim that variable A causes changes in variable B is strengthened if: • The association between A and B is strong. • The association is consistent - recurs in different circumstances; reduces chance that it is due to confounding. • Higher doses are associated with stronger responses. • The cause proceeds the effect in time. • The alleged cause is plausible.
Does smoking cause lung cancer? Unethical to investigate with a randomized comparative experiment. • Observational studies show strong association between smoking and lung cancer. • The evidence from several studies show consistent association between smoking and lung cancer. • More and longer cigarettes smoked, more often lung cancer occurs. • Smokers with lung cancer usually began smoking before they developed lung cancer. • It is plausible that smoking causes lung cancer Serves as evidence that smoking causes lung cancer, but not as strong as evidence from an experiment.