540 likes | 1.05k Views
Chapter 5. Relationships: Regression. Chapter 5: Regression. 1. Objectives (BPS chapter 5). Regression Regression lines The least-squares regression line Facts about least-squares regression Residuals Influential observations Cautions about correlation and regression
E N D
Chapter 5 Relationships: Regression Chapter 5: Regression 1
Objectives (BPS chapter 5) Regression Regression lines The least-squares regression line Facts about least-squares regression Residuals Influential observations Cautions about correlation and regression Association does not imply causation
Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. We wish to find the straight line that best fits our data. But which line best describes our data? We would like to have a numerical description of how the variables vary together. We would also like to make predictions based on the observed association between those two variables.
A regression line A regression line is a straight line that describes how a response variable (y) changes as an explanatory variable (x) changes. We often use a regression line to predict the value of y for a given value of x.
Linear Regression We wish to quantify the linear relationship between an explanatory variable and a response variable. We can then predict the average response for all subjects with a given value of the explanatory variable. Regression equation:y = a + bx x is the value of the explanatory variable y is the average value of the response variable note that aand b are just the y-intercept and slope of a straight line Chapter 5: Regression 5
Thought Question 1 How would you draw a line through the points? How do you determine which line ‘fits best’? Chapter 5: Regression 6
Linear Equations High School Teacher Chapter 5: Regression 7
The Linear Model Remember from Algebra that a straight line can be written as: In Statistics we use a slightly different notation: We write to emphasize that the points that satisfy this equation are just our predicted values, not the actual data values. = a + bx Chapter 5: Regression 8
Example: Fat Versus Protein The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu: We wish to fit a straight line through the data. Chapter 5: Regression 9
Residuals The model won’t be perfect, regardless of the line we draw. Some points will be above the line and some will be below. The estimate made from a model is the predicted value (denoted as ). Chapter 5: Regression 10
Residuals (cont.) The difference between the observed value and its associated predicted value is called the residual. To find the residuals, we always subtract the predicted value from the observed one: Chapter 5: Regression 11
Residuals (cont.) A negative residual means the predicted value is too big (an overestimate). A positive residual means the predicted value is too small (an underestimate). Chapter 5: Regression 12
“Best Fit” Means Least Squares Some residuals are positive, others are negative, and, on average, they cancel each other out. So, we can’t assess how well the line fits by adding up all the residuals. Similar to what we did with the standard deviation, we square the residuals and add the squares. The smaller the sum, the better the fit. The line of best fit is the line for which the sum of the squared residuals is smallest. Chapter 5: Regression 13
Least Squares Used to determine the “best” line We want the line to be as close as possible to the data points in the vertical (y) direction (since that is what we are trying to predict) Least Squares: use the line that minimizes the sum of the squares of the vertical distances of the data points from the line Chapter 5: Regression 14
The Linear Model (cont.) We write bandafor the slope and intercept of the line. The b and aare called the coefficients of the linear model. The coefficient b is the slope, which tells us how rapidly changes with respect to x. The coefficient a is the intercept, which tells where the line hits (intercepts) the y-axis. Chapter 5: Regression 15
How to: First we calculate the slope of the line, b. We already know how to calculate r, sx and sy. r is the correlation (the slope has the same sign as r) sy is the standard deviation of the response variable y sx is the the standard deviation of the explanatory variable x Once we know b, the slope, we can calculate a, the y-intercept: where x and yare the sample means of the x and y variables This means that we don’t have to calculate a lot of squared distances to find the least-squares regression line for a data set. We can instead rely on the equation. Some calculators can calculate r, a and b.
Example Fill in the missing information in the table below: Chapter 5: Regression 17
Facts about least-squares regression The distinction between explanatory and response variables is essential in regression. The correlation coefficient (r) and the slope (b) of the least-squares line have the same sign. The direction of the association determines the sign of the slope of the regression line. The least-squares regression line always passes through the point . The correlation r describes the strength of a straight-line relationship. The square of the correlation, r2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x. The square of the correlation is called Coefficient of Determination.
Data from the Hubble telescope about galaxies moving away from Earth: These two lines are the two regression lines calculated either correctly (x = distance, y = velocity, solid line) or incorrectly (x = velocity, y = distance, dotted line). The distinction between explanatory and response variables is crucial in regression. If you exchange y for x in calculating the regression line, you will get the wrong line. Regression examines the distance of all points from the line in the y direction only.
Interpretation of the Slope and Intercept The slope indicates the amount by which changes when x changes by one unit. The intercept is the value of when x = 0. It is not always meaningful. Chapter 5: Regression 20
Example The regression line for the Burger King data is Interpret the slope and the intercept. Slope: For every one gram increase in protein, the fat content increases by 0.97g. Intercept: A BK meal that has 0g of protein contains 6.8g of fat. Chapter 5: Regression 21
In predicting a value of y based on some given value of x ... 1. If there is no linear correlation, the best predicted y-value is y. Predictions 2. If there is a linear correlation, the best predicted y-value is found by substituting the x-value into the regression equation. Chapter 5: Regression 22
Example: Fat Versus Protein The regression line for the Burger King data fits the data well: The equation is The predicted fat content for a BK Broiler chicken sandwich that contains 30g of protein is 6.8 + 0.97(30) = 35.9 grams of fat. Chapter 5: Regression 23
Prediction via Regression Line Husband and Wife: Ages Hand, et al., A Handbook of Small Data Sets, London: Chapman and Hall • The regression equation is y = 3.6 + 0.97x • y is the average age of all husbands who have wives of age x • Suppose we know that an individual wife’s age is 30. What would we predict her husband’s age to be? • For all women aged 30, we predict the average husband age to be 32.7 years: 3.6 + (0.97)(30) = 32.7 years ^ Chapter 5: Regression 24
Caution!Beware of Extrapolation Extrapolation is the use of a regression line for predictions outside the range of x values used to obtain the line. This can be misleading, as seen here.
Caution !Beware of Extrapolation Sarah’s height was plotted against her age Can you predict her height at age 42 months? Can you predict her height at age 30 years (360 months)? Chapter 5: Regression 26
A CautionBeware of Extrapolation Regression line: = 71.95 + .383 x height at age 42 months? = 88 cm. height at age 30 years? = 209.8 cm. She is predicted to be 6' 10.5" at age 30. Chapter 5: Regression 27
Residuals Revisited Residuals help us to see whether the model makes sense. When a regression model is appropriate, nothing interesting should be left behind. After we fit a regression model, we usually plot the residuals in the hope of finding no apparent pattern. Chapter 5: Regression 28
Residual Plot Analysis A residual plot is a scatterplot of the regression residuals against the explanatory variable. If a residual plot does not reveal any pattern, the regression equation is a good representation of the association between the two variables. If a residual plot reveals some systematic pattern, the regression equation is not a good representation of the association between the two variables. Chapter 5: Regression 29
Residuals Revisited (cont.) The residuals for the BK menu regression look appropriately boring: Plot Chapter 5: Regression 30
Residuals are randomly scattered—good! A curved pattern—means the relationship you are looking at is not linear. A change in variability across plot is a warning sign. You need to find out what it is and remember that predictions made in areas of larger variability will not be as good.
Coefficient of Determination (R2) Measures usefulness of regression prediction R2 (or r2, the square of the correlation): measures the percentage of the variation in the values of the response variable (y) that is explained by the regression line r=1:R2=1: regression line explains all (100%) of the variation in y r=.7: R2=.49: regression line explains almost half (50%) of the variation in y Chapter 5: Regression 32
Along with the slope and intercept for a regression, you should always report R2so that readers can judge for themselves how successful the regression is at fitting the data. Statistics is about variation, and R2 measures the success of the regression model in terms of the fraction of the variation of y accounted for by the regression. R2 (cont) Chapter 5: Regression 33
r = 0.87 r2 = 0.76 Changes in x explain 0% of the variations in y. The value(s) y takes is (are) entirely independent of what value x takes. r = 0 r2 = 0 Here the change in x only explains 76% of the change in y. The rest of the change in y (the vertical scatter, shown as red arrows) must be explained by something other than x. r = −1 r2 = 1 Changes in x explain 100% of the variations in y. y can be entirely predicted for any given value of x.
Caution with regression Since regression and correlation are closely related, we need to check the same conditions for regression as we did for correlation: Quantitative Variables Condition Straight Enough Condition Outlier Condition Chapter 5: Regression 35
Caution with regression Do not use a regression on inappropriate data. Pattern in the residuals Presence of large outliers Use residual plots for help. Clumped data falsely appearing linear Recognize when the correlation/regression is performed on averages. A relationship, however strong, does not imply causation. Beware of lurking variables. Avoid extrapolating (predicting outside the observed x data range).
1.If there is no linear correlation, don’t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data. 3. A regression equation based on old data is not necessarily valid now. 4. Don’t make predictions about a population that is different from the population from which the sample data were drawn. • Guidelines for Using The Regression Equation Chapter 5: Regression 37
Vocabulary Marginal Change – refers to the slope; the amount the response variable changes when the explanatory variable changes by one unit. Outlier - A point lying far away from the other data points. Influential Point - An outlier that that has the potential to change the regression line. - Points that are outliers in either the x or y direction of a scatterplot are often influential for the correlation. - Points that outliers in the x direction are often influential for the least-squares regression line. Try Chapter 5: Regression 38
Outlier in y-direction Influential All data Without child 18 Without child 19 Are these points influential?
Vocabulary: Lurking vs. Confounding LURKING VARIABLE A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables. CONFOUNDING Two variables are confounded when their effects on a response variable cannot be distinguished from each other. The confounded variables may be either explanatory variables or lurking variables.
Lurking variables Lurking variables can falsely suggest a relationship. What is the lurking variable in these examples? How could you answer if you didn’t know anything about the topic? • Strong positive association between the number firefighters at a fire site and the amount of damage a fire does • Negative association between moderate amounts of wine drinking and death rates from heart disease in developed nations
Correlation Does Not Imply Causation Even very strong correlations may not correspond to a real causal relationship. Chapter 5: Regression 42
Evidence of Causation A properly conducted experiment establishes the connection Other considerations: A reasonable explanation for a cause and effect exists The connection happens in repeated trials The connection happens under varying conditions Potential confounding factors are ruled out Alleged cause precedes the effect in time Chapter 5: Regression 43
Evidence of Causation An observed relationship can be used for prediction without worrying about causation as long as the patterns found in past data continue to hold true. We must make sure that the prediction makes sense. We must be very careful of extreme extrapolation. Chapter 5: Regression 44
Reasons Two Variables May Be Related (Correlated) Explanatory variable causes change in response variable Response variable causes change in explanatory variable Explanatory may have some cause, but is not the sole cause of changes in the response variable Confounding variables may exist Both variables may result from a common cause such as, both variables changing over time The correlation may be merely a coincidence Chapter 5: Regression 45
Common Response(both variables change due to common cause) Explanatory: Divorce among men Response: Percent abusing alcohol • Both may result from an unhappy marriage. Chapter 5: Regression 46
Both Variables are Changing Over Time Both divorces and suicides have increased dramatically since 1900. Are divorces causing suicides? Are suicides causing divorces??? The population has increased dramatically since 1900 (causing both to increase). • Better to investigate: Has the rate of divorce or the rate of suicide changed over time? Chapter 5: Regression 47
The Relationship May Be Just a Coincidence Sometimes we see some strong correlations (or apparent associations) just by chance, even when the variables are not related in the population. Chapter 5: Regression 48
A required whooping cough vaccine was blamed for seizures that caused brain damage led to reduced production of vaccine (due to lawsuits) Study of 38,000 children found no evidence for the accusations (reported in New York Times) “people confused association with cause-and-effect” “virtually every kid received the vaccine…it was inevitable that, by chance, brain damage caused by other factors would occasionally occur in a recently vaccinated child” Coincidence (?) Vaccines and Brain Damage Chapter 5: Regression 49
Quiz • The least-squares regression line is A) the line that best splits the data in half, with half of the points above the line and half below the line. B) the line that makes the square of the correlation in the data as large as possible. C) the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. D) all of the above.