490 likes | 612 Views
AP Statistics Chapter 3. Scatterplots, Association, and Correlation. Relationships. “You can observe a lot by watching,” Yogi Berra Although this statement is said in jest, it carries much truth.
E N D
AP StatisticsChapter 3 Scatterplots, Association, and Correlation
Relationships • “You can observe a lot by watching,” Yogi Berra • Although this statement is said in jest, it carries much truth. • Many statistical studies look at multiple variables to try to show a relationship between one variable and another. • Most of the time, questions ask whether there is an association between two variables. • It is important to know the definition of a few words before we continue to explore this concept.
When one variable effects another, one variable will be referred to the explanatory variable and the other as the response variable • Explanatory Variable • “A variable that attempts to explain the observed outcomes.” • Response Variable • A variable that measures an outcome of a study. • Association • Any relationship between two measured quantities that renders them statistically dependent • Simply, if there is a (direct or indirect) link between two variables
Example • Suppose that I randomly select 10 students from a Stats. class and record their weight in pounds and get the following results: 103, 201, 125, 179, 150, 138, 181, 220, 113, 126 • Now, let’s say that I was going to pick another random student, could I come up with a prediction on how much that student was going to weigh? (The mean is 153.6 and standard deviation = 39.58) • How accurate will my prediction be? • Is there a way to improve this prediction?
Example • Now let’s say that I have more information: Weight: 103, 201, 125, 179, 150, 138, 181, 220, 113, 126 Height: 61, 68, 65, 69, 65, 61, 64, 72, 63, 62 • The following represents both the weights and heights (in inches) of the 10 students • Now, let’s say that I was going to pick another random student, knowing their height is 65 inches, could I come up with a prediction on how much that student was going to weigh? • How accurate will my prediction be? • Is this a better way to make a prediction?
Roles for Variables • When we have two variable (or bivariate data), it always a good idea to make a picture! • As we graph the two variables, it is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis. • This determination is made based on the roles played by the variables. • When the roles are clear, the explanatory or independent variable goes on the x-axis, and the response variable or dependent variable goes on the y-axis.
Roles for Variables (cont.) • The roles that we choose for variables are more about how we think about them rather than about the variables themselves. • Just placing a variable on the x-axis doesn’t necessarily mean that it explains or predicts anything. And the variable on the y-axis may not respond to it in any way. • In a cause and effect relationship, the explanatory variable is the cause, and the response variable is the effect. Regression is a method for predicting the value of a dependent variable y, based on the value of an independent variable x.
Examples • A study looks at smoking and lung cancer. • Which (if any) is the explanatory variable? • Which (if any) is the response variable? • Is smoking a quantitative or categorical variable? • Is lung cancer a quantitative or categorical variable?
Examples • A study looks at cavities and milk drinking. • Which (if any) is the explanatory variable? • Which (if any) is the response variable? • Is cavities a quantitative or categorical variable? • Is milk drinking a quantitative or categorical variable?
Examples • A study looks at rain fall and SAT scores. • Which (if any) is the explanatory variable? • Which (if any) is the response variable? • Is rainfall a quantitative or categorical variable? • Is SAT scores a quantitative or categorical variable?
Looking at Scatterplots • Scatterplots may be the most common and most effective display for two-variable data. • In a scatterplot, you can see patterns, trends, relationships, and even the occasional extraordinary value sitting apart from the others. • Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables.
Example • Now let’s revisit the problem we started with: Weight: 103, 201, 125, 179, 150, 138, 181, 220, 113, 126 Height: 61, 68, 65, 69, 65, 61, 64, 72, 63, 62 • The following represents both the weights and heights (in inches) of the 10 students • Graph the data and describe the distribution.
Regression • When we perform regression, we take two variables and we attempt to use the explanatory variable to estimate (or predict) the value of the response variable. This process is called Regression • As you can imagine, at times, there’s clear relationship between the two variable and sometimes there might not be any relationship. • Also, even if there is a relationship, some relationships are strong while others are weak. • To best describe the relationship, we should always describe the form, direction, and the strength.
Interpreting Association • Form: Is there a pattern? Is the data linear or curved? Are there clusters of data? • Strength: Is it weak or strong? Does the data tightly conform or loosely conform? • Direction: If linear, does the data go up (positively associated) or go down (negatively associated)or is it a horizontal line (no association)? • Deviationsfrom pattern: Are there areas where the data conform less to the pattern? Are there any outliers?
(Percent of graduates taking SAT vs. the Average SAT Math Score) • Attributes of a good scatterplot • Consistent and uniform scale • Label on both axis • Accurate placement of data • Data throughout the axis • Axis break lines if not starting at zero. • To achieve these goal you are required to do your scatterplots on graph paper.
Examples • Try to make a graph of the following situation, then describe the association: • Points allowed vs. winning percentage • World population vs. year • Amount of rain vs. crop yield • Height vs. weight • Height vs. GPA • Shoe size vs. probability of winning the National Spelling Bee
Bivariate Data - Review • The very first step to analyzing bivariate data is to graph it. When we graph this data, we use a scatterplot. • After we graph it, we examine three things to describe the association: • Form – linear or curved (we will discussed curved data later in this unit) • Direction – positive, negative, or no association • Strength – strong or weak
Form: If there is a straight line (linear) relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form of a line. Looking at Scatterplots • How should we describe the form for these two graphs?
Direction: A positive association generally tells us that as one variable increases, the other variable also increases. Looking at Scatterplots (cont.) • A negative association generally tells us that as one variable increases, the other variable decreases. • In this example, there is a negative association between central pressure and maximum wind speed is given. • As the central pressure increases, the maximum wind speed decreases.
Looking at Scatterplots (cont.) • Direction: • When the points are scattered about randomly with no discernable pattern, we say that there is no association • No association generally tells us that as one variable increases, we know nothing about the other variable • The scatterplot above show no association between the explanatory and response variables.
Looking at Scatterplots (cont.) • Strength: • This describes how “tightly” the points follow the form (or pattern) • A strong association has points follow the pattern very tightly; whereas a weak association has points that follow the pattern, but in a much “looser” manner. Note: we will quantify the amount of scatter soon. This has a strong, positive linear association This has a weak, negative linear association
Looking at two-variable data • Let’s look at a real-life example to make this idea a little more concrete… • Do taller people tend to have heavier weights? • This question is an example of how two variable play different roles in data. • Height is the explanatory or predictor variable and weight is the response variable. • Let’s take a look at the Detroit Pistons
2003-04 Detroit Pistons # PLAYER POS HT WT DOB FROM YRs 7 Chucky Atkins G 5-11 160 8/14/74 South Florida '96 4 1 Chauncey Billups G 6-3 202 9/25/76 Colorado '97 6 41 Elden Campbell C-F 7-0 279 7/23/68 Clemson '90 13 44 Hubert Davis G 6-5 183 5/17/70 North Carolina '92 11 Carlos Delfino** G 6-6 230 8/29/82 Argentina R Andreas Glyniadakis C 7-1 280 8/21/81 Greece R Darvin Ham F-G 6-7 240 7/23/73 Texas Tech '96 6 32 Richard Hamilton G-F 6-7 193 2/14/78 Connecticut '00 4 Lindsey Hunter G 6-2 195 12/03/70 Jackson State '93 10 31 Darko Milicic F-C 7-0 245 6/20/85 Serbia-Montenegro R 13 Mehmet Okur F 6-11 249 5/26/79 Turkey 1 22 Tayshaun Prince F 6-9 215 2/28/80 Kentucky '02 1 39 Zeljko Rebraca C 7-0 257 4/09/72 Serbia-Montenegro 2 52 Don Reid F 6-8 250 12/30/73 Georgetown '95 8 Bob Sura G 6-5 200 3/25/73 Florida State '95 8 3 Ben Wallace F-C 6-9 240 9/10/74 Virginia Union '96 7 34 Corliss Williamson F 6-7 245 12/04/73 Arkansas '96 8 Pistons Roster
300 275 250 225 200 175 150 125 68 70 72 74 76 78 80 82 84 86
Correlation Coefficient • The correlation coefficient numerically measures the strength of the linear association between two quantitative variables. Correlation = Linear Association = Relationship • There are three conditions that must be met before we can look at the correlation coefficient: • Quantitative Condition: Correlation only applies to quantitative variables. Don’t apply correlation to categorical data! • Straight Enough Condition: Correlation measures the strength of linear association, which is useless if the data is not linear!
Correlation Coefficient • There is one more condition: • Outlier Condition: • Outliers can distort the correlation dramatically. • An outlier can make an otherwise small correlation look big or hide a large correlation. • It can even give an otherwise positive association a negative correlation coefficient (and vice versa). • When you see an outlier, it’s often a good idea to report the correlations with and without the point. • Note: when asked about correlation, you should memorize this phrase: • With a correlation of (r), there is a (strong/weak), (positive/negative) linear association between the (explanatory variable) and the (response variable)
Correlation Coefficient • Is there a “correlation” between a basketball team’s heights and weights? • Is the association positively associated or negatively associated? • Is the association strong or weak?
What do we do with correlation? Examine pg. 151
Calculating Correlation Coefficient • The calculation of correlation is based on mean and standard deviation. • Remember that both mean and standard deviation are not resistant measures.
Calculating Correlation Coefficient • What does the contents of the parenthesis look like? • What happens when the values are both from the lower half of the population? From the upper half? Both z-values are negative. Their product is positive. The formula for calculating z-values. Both z-values are positive. Their product is positive.
Calculating Correlation Coefficient • What happens when one value is from the lower half of the population but other value is from the upper half? One z-value is positive and the other is negative. Their product is negative.
Using the TI-84 to calculate r • If you have a TI-84, you must turn your Diagnostic on; you need to enter “DiagnosticOn” from the “Catalog” • TI-89 users don’t need to worry about this operation since the Diagnostic is automatically “on” in your calculator
Using the TI to calculate r • Run LinReg(a+bx) with the explanatory variable as the first list, and the response variable as the second list TI-84 TI-89
Using the TI to calculate r • The results are the slope and vertical intercept of the regression equation (more on that later) and values of r and r2 (more on r2 later as well).
Facts about correlation • Both variables need to be quantitative • Because the data values are standardized, it does not matter what units we use for each of the variables • Also, since r uses standardized values of the observations, r does not change if we change the units of x, y, or both (in other words, we can multiply, divide, add or subtract a value to x, y, or both and r will stay the same) • The value of r is unit-less.
Facts about correlation • The value of r will always be between -1 and 1. • Values closer to -1 reflect strong negative linear association. • Values closer to +1 reflect strong positive linear association. • Values close to 0 reflect no linear association. • Correlation does not measure the strength of non-linear relationships
Facts about correlation • Correlation is blind to the relationship between explanatory and response variables. • Even though you may get a r value close to -1 or 1, it does not mean that you can say that the explanatory variable causes the response variable. • Scatterplots and correlation coefficients never prove causation. • A hidden variable that stands behind a relationship and determines it by simultaneously affecting the other two variables is called a lurking variable.
Facts about correlation • The value of r is a measure of the strength of a linear relationship. It measures how closely the data fall to a straight line. An r value near 0, however, does not imply that there is no relationship, only no linear relationship. For example, quadratic or sinusoidal data have an r close to 0, even though there is a strong relationship present. • r measures the correlation between 2 variables in a sample of observations from the population of interest. Thus, r is the sample correlation coefficient which is used to estimate ρ (rho), the population correlation coefficient.
What Can Go Wrong? • Don’t say “correlation” when you mean “association.” • More often than not, people say correlation when they mean association. • The word “correlation” should be reserved for measuring the strength and direction of the linear relationship between two quantitative variables.
What Can Go Wrong? • Don’t correlate categorical variables. • Be sure to check the Quantitative Variables Condition. • Don’t confuse “correlation” with “causation.” • Scatterplots and correlations never demonstrate causation. • These statistical tools can only demonstrate an association between variables.
What Can Go Wrong? (cont.) • Be sure the association is linear. • There may be a strong association between two variables that have a nonlinear association. The correlation will be near 0 – why do you think that is?
What Can Go Wrong? (cont.) • Don’t assume the relationship is linear just because the correlation coefficient is high. • Here the correlation is 0.979, but the relationship is actually bent.
What Can Go Wrong? (cont.) • Beware of outliers. • Even a single outlier can dominate the correlation value. • Make sure to check the Outlier Condition. • Without the outlier, the correlation would be 0; but with the outlier, the correlation, deceivingly, is much closer to 1
What have we learned? • We examine scatterplots for form, direction, strength, and unusual features. • Although not every relationship is linear, when the scatterplot is straight enough, the correlation coefficient is a useful numerical summary. • The sign of the correlation tells us the direction of the association. • The magnitude of the correlation tells us the strength of a linear association. • Correlation has no units, so shifting or scaling the data, standardizing, or swapping the variables has no effect on the numerical value.
What have we learned? (cont.) • Doing Statistics right means that we have to Think about whether our choice of methods is appropriate. • Before finding or talking about a correlation, check the Straight Enough Condition. • Watch out for outliers! • Don’t assume that a high correlation or strong association is evidence of a cause-and-effect relationship—beware of lurking variables!
What have we learned? (cont.) • Unusual features: • Look for the unexpected. • Often the most interesting thing to see in a scatterplot is the thing you never thought to look for. • One example of such a surprise is an outlier standing away from the overall pattern of the scatterplot. • Clusters or subgroups should also raise questions.