380 likes | 487 Views
Correlation. Correlation. It should come as no great surprise that there is an association between height and weight Yes, as you would expect, taller students tend to weigh more (or, conversely, heavier students tend to be taller). Standardizing.
E N D
Correlation • It should come as no great surprise that there is an association between height and weight • Yes, as you would expect, taller students tend to weigh more (or, conversely, heavier students tend to be taller)
Standardizing • We want to put a number on the strength of the association between the two variables of a scatter plot • We want it to be unaffected by our unit choice (i.e. kg vs. lbs) because these don’t change the direction, form, or strength of the relationship. • Let’s eliminate any units by standardizing each variable. • Now, instead of each point we’ll have the standardized coordinates
Standardizing • Standardizing makes the means of both variables 0, so the new center of the scatterplot is at the origin. The scales on both axes are now standard deviation units.
Effects of Standardizing • Underlying linear patterns often appear steeper in the standardized plot? • Why? • Hint: Look to units being used
Reading Standardized Plots • Which points in the scatterplot of z-scores give the impression of a positive association? • For a positive association, y tends to increase as x increases. • Points in the upper right and lower left strengthen the impression of association For these points, have the same sign so the product is positive Points far from the origin have bigger products
Reading Standardized Plots • Which points in the scatterplot of z-scores give the impression of a negative association? • The red points in the upper left and lower right quadrants tend to weaken the positive association or support a negative association. For these points, have opposite signs so the product is negative Points far from the origin have a product large in magnitude
Reading Standardized Plots • The points with a z-score of zero on either variable don’t vote either way because • These points are colored blue
A Measure of Correlation • To turn these products into a measure of the strength of the association, just add up the products for every point in the scatterplot: This summarizes the direction and strength of the association for all the points. If most of the points are in the green quadrant, the sum will tend to be positive. If most are in the red quadrant, the sum will tend to be negative.
A Measure of Correlation • Right now, the size of the sum gets bigger the more data we have. To adjust for this, the statistician’s natural thing to do is divide the sum by • We call this ratio the correlation coefficient
Quick Check • What does a correlation coefficient of r = 0.8 look like? • What does a correlation coefficient of r = 0.3 look like? • Note: r will always be used for correlation
Correlation Conditions • Correlationmeasures the strength of the linear association between two quantitative variables. To use correlation, you must check several conditions: • Quantitative Variable Condition: Correlation applies only to two quantitative variables, it cannot be used for any categorical variables. Check to make sure you know the unit’s variables and what they measure.
Correlation Conditions • Correlationmeasures the strength of the linear association between two quantitative variables. To use correlation, you must check several conditions: • Straight Enough Condition: Is the form of the scatterplot straight enough so that a linear relationship makes sense?
Correlation Conditions • Correlationmeasures the strength of the linear association between two quantitative variables. To use correlation, you must check several conditions: • Outlier Condition: Outliers can distort the correlation dramatically. It can make an otherwise weak correlation look big or hide a strong correlation. It can even given an otherwise positive association a negative correlation coefficient and vice versa. When you see an outlier, it is often a good idea to report the correlation with and without that point.
Just Checking • Lets say I gave two exams both worth 50 points and reported that the correlation between the two scores was 0.75 • 1) Before answering any questions about the correlation, what would you like to see, and why? • Answer: We know the scores are quantitative so we should check to see if the Straight Enough Condition and the Outlier Condition are satisfied by looking at the scatterplot of the two scores.
Just Checking • Lets say I gave two exams both worth 50 points and reported that the correlation between the two scores was 0.75 • 2)If she adds 10 points to each Exam 1 score, how will this change the correlation? • Answer: It will not change.
Just Checking • Lets say I gave two exams both worth 50 points and reported that the correlation between the two scores was 0.75 • 3) If she standardizes scores on each exam, how will this affect the correlation? • Answer: It will not change.
Just Checking • Lets say I gave two exams both worth 50 points and reported that the correlation between the two scores was 0.75 • 4) In general, if someone did poorly on exam 1 are they likely to have done poorly on exam 2? Explain. • Answer: They are likely to have done poorly. The positive correlation means low scores on exam 1 are associated with low scores on exam 2.
Just Checking • Lets say I gave two exams both worth 50 points and reported that the correlation between the two scores was 0.75 • 5) If someone did poorly on exam 1 can you be sure they did poorly on exam 2? Explain. • Answer: No. The general association is positive, but individual performances may vary.
Correlation Properties • The sign of a correlation coefficient gives the direction of the association • Correlation is always between -1 and +1, but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line. • Correlation treats x and y symmetrically, the correlation of x with y is the same as the correlation of y with x.
Correlation Properties • Correlation has no units. Correlation is sometimes given as a percentage, but you probably shouldn’t do that because it suggests a percentage of something – and correlation, lacking units, has no “something” of which to be a percent. • Correlation is not affected by changes in the center or scale of either variable. Changing the units or baseline of either variable has no effect on the correlation coefficient. Correlation depends only on the z-scores, and they are unaffected by changes in center or scale
Correlation Properties • Correlation measures the strength of the linear association between the two variables. Variables can be strongly associated but still have a small correlation if the association is not linear. • Correlation is sensitive to outliers. A single outlying value can make a small correlation large or a large correlation small.
The more firemen fighting a fire, the bigger the fire is observed to be. • Therefore firemen cause an increase in the size of a fire.
As ice cream sales increase, the rate of drowning deaths increases sharply. • Therefore, ice cream consumption causes drowning.
Since the 1950s, both the atmospheric levels and obesity levels have risen sharply. • Therefore, global warming is causing obesity.
A hidden variable that stands behind a relationship and determines it by simultaneously affecting the other two variables is called a lurking variable. • Ice cream sales and drowning are both caused by increased number of beach-goers during the summer. • Obesity and global warming are both caused by increased wealth and energy consumption
Correlation Tables • It is common in some fields to compute the correlations between every pair of variables in a collection and arrange these correlations in a table. • Why is this dangerous?
Straightening Scatterplots • An Example With Cameras • Some camera lenses have an adjustable aperture, the hole that lets the light in. • The size of this aperture is expressed as a mysterious number called the f/stop • Each increase of one f/stop number corresponds to halving the light that is allowed to come through. • When we halve the shutter speed we cut down the light that gets let in, so you have to open the aperture one notch.
Straightening Scatterplots • We can experiment to find the best f/stop values for each shutter speed.
Straightening Scatterplots • The correlation of these shutter speeds and f/stops is .979. That sounds pretty high and you might assume a strong linear relationship. But when we check the scatterplot it shows something is not quite right.
Straightening Scatterplots • We can see that f/stop is not linearly related to shutter speed. Can we find a transformation of f/stop that straightens out the line? • What if we look at the square of the f/stop against the shutter speed?
Straightening Scatterplots • The correlation is now .998 but the increase in correlation is not important. What is important is that the form of the plot is now straight, so the correlation is now an appropriate measure of correlation
Homework Pg 165, # 12, 17, 23, 27, 33