Chapter 3: Scatterplots and Correlation

AP STATISTICS Chapter 3: Scatterplots and Correlation

Size and abundance of carnivores

Basically r tells us the strength and direction of two variables

Properties of the Linear Correlation Coefficient r • 1. –1 £r£ 1 • 2. Value of r does not change if all values of either variable are converted to a different scale • 3. Interchanging all x and y values will not change r • 4. r measures strength of a linear relationship • 5. No linear relationship does not imply no relationship at all. There is a possibility of a non-linear relationship.

250 200 150 100 50 0 1 2 3 4 5 6 7 8 0 non-linear relationship Title Distance (feet) Time (seconds)

Abundance of carnivores Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. R does not change when we change units of measurement. R is not resistant and is strongly affected by outliers In addition, we would like to have a numerical description of how both variables vary together. For instance, is one variable increasing faster than the other one? And we would like to make predictions based on that numerical description.

Y Y Y Y Y Y X X X X X X Interpreting Correlation • r = 1 • A perfect straight line • tilting up to the right • r = 0 • No overall tilt • No relationship? • Not so no linear one • r = –1 • A perfect straight line tilting down to the right

r = –0.471 r = 0.089 r = 0.395 No Correlation Nonlinear Correlation

6.0% 90 eBay Interest rate 5.5% 60 Yahoo! Yahoo! Minutes per person MSN 30 5.0% 0% 1% 2% 3% 4% 0 Loan fee 0 0 100 100 200 200 Pages per person Pages per person Examples • Correlation is r = 0.964 • Very strong positive association • (since r is close to 1) Internet Site Ratings • Linear relationship • Straight line with scatter • Increasing relationship • Tilts up and to the right positive Mortgage Rates & Fees • Correlation is r = – 0.890 • Strong negative association • Linear relationship • Straight line with scatter • Decreasing relationship • Tilts down and to the right

Example • Is there momentum? • If the market was up yesterday, is it more likely to be up today? • Or is each day’s performance independent? • Correlation is r = 0.11 • A weak relationship? • No relationship? • Tilt is neither • up nor down The Stock Market

160 150 140 Yield of process 130 120 500 600 700 800 900 Temperature Example • A nonlinear relationship • Not a straight line: A curved relationship • Correlation r = – 0.0155 • r suggests no relationship but in reality there is a relationship it is just NOT linear • But relationship is strong • It tilts neither up nor down Maximizing Yield

2,000 20 Circuit miles (millions) Log of miles 1,000 0 15 15 20 0 1,000 2,000 Investment ($millions) Log of investment Example • unequal variability Variability is stabilized by taking logarithms (lower right Correlation r = 0.820 investment investment

All the same R value However, making the scatterplots shows us that the correlation/ regression analysis is not appropriate for all data sets. Just one very influential point and a series of other points all with the same x value; a redesign is due here… Obvious nonlinear relationship; regression inappropriate. One point deviates from the (highly linear) pattern of the other points; it requires examination before a regression can be done Moderate linear association; regression OK.

Always plot your data! The correlations all give r ≈ 0.816, and the regression lines are all approximately = 3 + 0.5x. For all four sets, we would predict = 8 when x = 10. The previous slides data.

We want the line as close as possible to the points.

So, we need the line that will minimize the sum of the squares of the vertical distances.

Distances between the points and line are squared so all are positive values. This is done so that distances can be properly added The regression line The least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is the smallest possible. BAC levels

Properties LSRL The least-squares regression line can be shown to have this equation: is the predicted y value (y hat) b is the slope a is the y-intercept "a" is in units of y "b" is in units of y/units of x

How to: First we calculate the slope of the line, b, from statistics we already know: r is the correlation sy is the standard deviation of the response variable y sx is the the standard deviation of the explanatory variable x Once we know b, the slope, we can calculate a, the y-intercept: where x and y are the sample means of the x and y variables This means that we don’t have to calculate a lot of squared distances to find the least-squares regression line for a data set. We can instead rely on the equation.

BEWARE !!! Not all calculators and software use the same convention: We use this one: Ti-89 and Ti-83 give both Some use instead: Make sure you know what YOUR calculator gives you for a and b before you answer homework or exam questions.

The y-intercept Sometimes the y-intercept is not biologically possible. Here we have negative blood alcohol content, which makes no sense… But the negative value is appropriate for the equation of the regression line. There is a lot of scatter in the data and the line is just an estimate. BAC

The equation completely describes the regression line. To plot the regression line, you only need to plug two x values into the equation, get y, and draw the line that goes through those two points. Hint: The regression line always passes through the mean of x and y Thus: The points you use for drawing the regression line are derived from the equation. They are NOT points from your sample data (except by pure coincidence). Regression examines the distance of all points from the line in the y direction only

Correlation and regression The correlation is a measure of spread (scatter) in both the x and y directions in the linear relationship In regression we examine the variation in the response variable (y) given change in the explanatory variable (x)

Equations The general form for a regression line is

Coefficient of determination, r2 • It measures the fraction (or percent) of the variation in y that is explained by the least squares regression line of y on x. (Memorize this sentence!) • Basically, this helps us interpret r in a more understandable way since r2 is a percent.

Coefficient of determination, r2 r2, the coefficient of determination, is the square of the correlation coefficient r r2 represents the percentage of the variance in y(vertical scatter from the regression line) that can be explained by changes in x.

r = 0.87 r2 = 0.76 Changes in x explain 0% of the variations in y. The value(s) y takes is (are) entirely independent of what value x takes. r = 0 r2 = 0 Here the change in x only explains 76% of the change in y. The rest of the change in y (the vertical scatter, shown as red arrows) must be explained by something other than x. Changes in x explain 100% of the variations in y. y can be entirely predicted for any given value of x.

r =0.9 r2 =0.81 There is quite some variation in BAC for the same number of beers drunk. A person’s blood volume is a factor in the equation that was overlooked here. We changed the number of beers to the number of beers/weight of a person in pounds. • In the first plot, number of beers only explains 49% of the variation in blood alcohol content. • But number of beers/weight explains 81% of the variation in blood alcohol content. • Additional factors contribute to variations in BAC among individuals (like maybe some genetic ability to process alcohol).

Grade performanceIf class attendance explains 16% of the variation in grades, what is the correlation between percent of classes attended and grade? 1. We need to make an assumption: Attendance and grades are positively correlated. So r will be positive too. 2. r2 = 0.16, so r = +√0.16 = + 0.4 A weak correlation.

Residuals The distances from each point to the least-squares regression line give us potentially useful information about the contribution of individual data points to the overall pattern of scatter. These distances are called “residuals.” The sum of these residuals is always 0 Points above the line have a positive residual. Points below the line have a negative residual ^ Predicted y Observed y

Residual plots Residuals are the distances between y-observed and y-predicted. We plot them in a residual plot. If residuals are scattered randomly around 0, chances are your data fit a linear model, were normally distributed, and you didn’t have outliers

The x-axis in a residual plot is the same as on the scatterplot. • The line on both plots is the regression line. Only the y-axis is different

Residuals are randomly scattered—good! A curved pattern—means the relationship you are looking at is not linear but does not mean there is no relationship. A change in variability across plot is a warning sign. You need to find out why it is and remember that predictions made in areas of larger variability will not be as good.

Child 19 = outlier in y direction Child 19 is an outlier of the relationship. Child 18 is only an outlier in the x direction and thus might be an influential point. Child 18 = outlier in x direction Outliers and influential points Outlier: An observation that lies outside the overall pattern of observations. “Influential individual”: An observation that markedly changes the regression if removed. This is often an outlier on the x-axis.

5,000 Cost Cost 10,000 4,000 0 3,000 0 20 40 60 20 30 40 50 Number produced Number produced Example: Cost and Quantity • An outlier is visible • A disaster (a fire at the factory) • High cost, but few produced outlier Outlier removed Cost vs. Number Produced Cost vs. Number Produced

Outlier in y-direction Are these points influential? Influential

OUTLIERS IN TERMS OF REGRESSION: Observations with large (in absolute value) residuals. Observations falling f a r from the regression line while not following the pattern of the relationship apparent in the others Residual=actual-fitted Residual = y-y(hat)

INFLUENTIAL POINTS ARE: • Points whose removal would greatly affect the association of two variables • Points whose removal would significantly change the slope of an LSR line • Points with a large moment (i.e they are far away from the rest of the data.) • Usually outliers in the x direction.

The two graphs below show the same data – the one on the right with the removal of thegreen data point. Asyou can see, the removal of this pointsignificantly affects the slope of the regression line. This is an influential point!

!!!REMEMBER!!! An observation does NOT have to be an Outlier to be an Influential Point!! Nor does an observation need to be an Influential Point in order to be an Outlier!!

Given the five-number summary {8 21 35 43 77}, which of the following is correct? A. There are no outliers B. There are at least two outliers C. There is not enough data to make any conclusion D. There is exactly one outlier E. There is at least one outlier

The correct answer is E The five number summary gives you {Min Q1 Median Q3 Max} The IQR is calculated by Q3-Q1 So, the IQR for the given data is 43-21=22 An outlier for this data would be: >Q3+1.5*IQR or <Q1-1.5*IQR  >43+(22*1.5)=76 or <21-(22*1.5)=-12 Since the max is 77, there must be at least oneoutlier in this data set, but we cannot conclude how many outliers without more data.

Given the following scatterplot and residual plot. Which of the following is true about the yellow data point? I. It is an influential point II. It is an outlier with respect to the regression model III. It appears to be an outlier in the x direction A. I only B. I and II C. I and III D. None of the above E. All of the above

The correct answer is c I. Because this point has a large moment and is far from the rest of the data, it is an influential point. If this point was removed, the slope of the line would markedly change. II. This point is not an outlier with respect to the model because as you can see in the residual plot, it does not have a large residual (It follows the regression pattern of the data). III. By looking at both the scatterplot and the residual plot, you can see that the yellow point is an outlier in the x direction (far right of the rest of the data).

60 50 40 Salary ($thousand) 30 20 Experience 0 10 20 Example: Salary and Experience • Salary vs. Years Experience • For n = 6 employees • Linear (straight line) relationship • Increasing relationship • higher salary generally goes with higher experience • Correlation r = 0.8667 Mary earns $55,000 per year, and has 20 years of experience Salary and Experience Experience 15 10 20 5 15 5 Salary 30 35 55 22 40 27

60 Salary = 15.32 + 1.673 Experience 50 Salary (Y) 40 30 20 10 Experience (X) 0 10 20 The Least-Squares LineY=a+bX • Summarizes bivariate data: Predicts Y from X • with smallest errors (in vertical direction, for Y axis) • Intercept is 15.32 salary (at 0 years of experience) • Slope is 1.673 salary (for each additional year of experience, on average) Salary

Chapter 3: Scatterplots and Correlation