200 likes | 213 Views
Learn how to determine the shape, strength, direction, and lurking variables in associations between numerical variables. Discover how linear and nonlinear associations, positive and negative directions, and lurking variables impact data analysis. Understand the properties of the linear correlation coefficient and the process of identifying outliers.
E N D
Shape of an Association If the points of a scatterplot lie close to (or on) a line, we say the variables are linearly associated and that there is a linear association. If the points of a scatterplot lie close to (or on) a curve that is not a line, we say there is a nonlinear association.
Strength of an Association If a curve passes through all the points of a scatterplot, we say there is an exact association with respect to the curve. If a curve comes quite close to all the points, we say there is a strong association with respect to the curve. If a curve comes somewhat close to all the points, we say there is a weak association with respect to the curve. The definitions for strong and weak associations are vague and require us to make a judgment call.
Example: Interpreting a Scatterplot The figure shows a scatterplot comparing carbohydrates and calories for 29 pizzas made by six of the leading pizza companies.
Example: Continued A scatterplot comparing carbohydrates and fat for the same pizzas is shown.
Example: Continued 1. For each of the scatterplots, describe the shape of the association. 2. Compare the strengths of the two associations. 3. For each of the scatterplots, identify whether the association is positive, negative, or neither. 4. Estimate the caloric content of the pizza with 53 g of carbohydrates. 5. Estimate the fat content of the pizza with 53 g of carbohydrates.
Solution 1. For both scatterplots, the points lie near a line. So, the associations are both linear. 2. The points in the first scatterplot lie closer to a line than the points in the second scatterplot. So, the association between carbohydrates and calories is stronger than the association between carbohydrates and fat. 3. For both scatterplots, the response variable increases as the number of grams of carbohydrates increases. So, the associations are both positive.
Solution 4. We start at 53 on the horizontal axis and look up to the (red) dot. Then we look to the left of the dot and determine that the number of calories is approximately 500 g.
Solution 5. We start at 53 on the horizontal axis and look up to the (green) dot. Then we look to the left of the dot and determine that the fat content is approximately 21 g.
Properties of the Linear Correlation Coefficient Assume r is the linear correlation coefficient for the association between two numerical variables. Then • The values of r are between –1 and 1, inclusive. • If r is positive, then the variables are positively associated. • If r is negative, then the variables are negatively associated. • If r = 0, there is no linear association.
Properties of the Linear Correlation Coefficient Assume r is the linear correlation coefficient for the association between two numerical variables. Then • The larger the value of |r| , the stronger the linear association will be. • If r = 1, then the points lie exactly on a line and the association is positive. • If r = –1, then the points lie exactly on a line and the association is negative.
Order of Determining the Four Characteristics of an Association We determine the four characteristics of an association in the following order: 1. Identify all outliers. a. For outliers that stem from errors in measurement or recording, correct the errors if possible. If the errors cannot be corrected, remove the outliers. b. For other outliers, determine whether they should be analyzed in a separate study. 2. Determine the shape of the association.
Order of Determining the Four Characteristics of an Association 3. If the shape is linear, then on the basis of r and the scatterplot, determine the strength. If the shape is nonlinear, then on the basis of the scatterplot, determine the strength. 4. Determine the direction. In other words, determine whether the association is positive, negative, or neither.
Lurking Variable Definition A lurking variable is a variable that causes both the explanatory and response variables to change during the study.
Example: Determining Possible Lurking Variables The scatterplot compares the scores of Test 4 and Test 5 for a calculus course taught by the author.
Example: Continued 1. Describe the four characteristics of the association. 2. Does a higher score on Test 4 cause a higher score on Test 5? If no, describe at least one possible lurking variable. 3. On the basis of the fairly strong, positive association between Test 4 and Test 5 scores, a student concludes that there must have been a fairly strong, positive association between any pair of tests for the course. What would you tell the student?
Solution 1. • Outliers: There are no outliers. • Shape: The points appear to come close to a line, so the association is linear.
Solution Strength: We compute that r is 0.81. Because r is fairly close to 1 and the points in the scatterplot appear to lie fairly close to a line, we conclude that the association is fairly strong.
Solution • Direction: As Test 4 points increase, Test 5 points tend to increase as well, so the association is positive. This checks with r = 0.81 being positive. 2. A student’s score on Test 4 does not directly affect the student’s score on Test 5. There are many possible lurking variables, including a student’s study habits, the consistency in the level of difficulty of the tests, Test 4’s impact on a student’s motivation or confidence, the extent to which Test 5’s concepts build on Test 4’s, and a student’s strength in algebra, which is a useful tool in calculus.
Solution 3. The fairly strong, positive association between Test 4 and Test 5 scores tells us nothing about the other test scores. In fact, it turns out that although there is a positive association between Test 1 and Test 5 scores, the association is much weaker (r= 0.58) than the association for Test 4 and Test 5 scores (r= 0.81). The impacts of the various lurking variables described in our solution to Problem 2 might explain why.