1 / 35

Understanding Correlation and Linear Regression

Learn about correlation coefficient, linear relationships, coefficient of determination, and interpreting relationships between variables.

wgraber
Download Presentation

Understanding Correlation and Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3 Linear Regression http://www.youtube.com/watch?v=aqDGkBS86O4

  2. Warmup • No calculator: Find the equation of a line passing through the points (3, 10) and (5, 12). • Find the equation of a line passing through the points (-1, 7) and (2, -3) (Use only +, -, x,  on the calculator)

  3. 3-1(Describing two-variable relationships) • Create a scatterplot • Answer four questions • Linear or Curved? • Positive or Negative direction • Strong or Weak relationship? • Any apparent outliers? • The closer data is to a line of best fit, the stronger the relationship, regardless of direction.

  4. Describing Relationships • Use variable names, not x and y • There’s more to the story on shape (curved, linear), but we’re holding off on that until late in the course when we talk about transformations.

  5. Correlation—Part A • Correlation coefficient r • Describes the strength and direction of the linear relationship between two variables. • Even if the data follow a curved pattern, the calculator will happily spit out an r value. Beware!

  6. Calculating r • Consider the points • (2, 3) • (5, 5) • (8, 4)

  7. Correlation coefficient r • …is the average of the products of the standardized x and standardized y for each point. • Remember z-scores? For each point: • Find the z-score for the x value. • Find the z-score for the y value. • Multiply the two. • Add all those products together. • Divide by n-1, where n is the number of points.

  8. …In symbols:

  9. Or perhaps simpler

  10. Example: Calculate r Take 1 and divide by n-1 (or 2) and get r = .5

  11. Fun facts about correlation • r does not change if you reverse the axes! (IOW, you need not identify which variable is explanatory and which is response.) • Both variables must be numerical • r does not change if you perform a linear transformation on one or both variables. • Positive r, positive relationship (neg r, neg relationship)

  12. Fun facts about correlation • r is always between -1 and 1. If not, you’ve made a mistake! • Values near 0 indicate a weak relationship. Values near -1 or +1 indicate a strong relationship. However…much room for interpretation.

  13. What happens to r if… • You add a point close to the line? • You add a point far from the line? • You remove a point close to the line? • You remove a point far from the line? • Sir Francis Galton (1822-1911) • Invented correlation and r !! • Turned regression into a general method for studying relationships between variables • Cousin of Charles Darwin • Click on Sir Francis to go to the scatterplot applet • YMS, p.120

  14. Why do pirates love correlation?

  15. Why do pirates love correlation? r !! Not an actual pirate

  16. Please turn off your Stat Wizard Mode Stat Wizard (toggle it to Off) Return the calculator to Day 1 settings 2nd—Memory Reset—All Ram—Reset--Enter

  17. Performing a linear regression using your calculator • X’s into L1, Y’s into L2 • Stat—Calc—Linreg(a+bx) L1,L2. (Not “Linregax+b.” It’s a stats thing.) • If you want to see r and r2: 2nd Catalog…Diagnostic On. Important: If you ever change batteries, you must perform this step again.

  18. Adding the regression line to a scatterplot • Data into L1, L2 • Stat—Calc—Linreg (a+bx) L1,L2,Vars—Y-Vars—Function—Y1—Enter. • Two great things just happened: • When you create a scatterplot, the regression line will appear on screen • If you hit the “y=“ button, your regression line is sitting there for you to see (but with many digits).

  19. The Coefficient of Determinationr2 • Recall: r is the Correlation Coefficient • r2is not simply “r, only more so”! • The coefficient of determination r2is the proportion of variability in the repsonse variable “explained” by the regression. • It’s another way of saying, “By introducing this other variable, how much better is my estimate than it would be if I simply used the average to make my estimate.”

  20. The Coefficient of Determinationr2 • Example: In a study of bone density as a function of body weight, an r of .6 is noted. Interpretation 1 (full credit): “About 36% of the variability in bone density is explained by the linear regression of bone density on body weight.” STUDENTS—KNOW THESE WORDS Interpretation 2 (less credit): “About 36% of variability in bone density is accounted for by body weight.”

  21. The Coefficient of Determinationr2 • Interpretation 3 (less credit still, but often heard among cognoscenti): “About 36% of bone density is explained by body weight.” • Interpretation 4 (not bad, really, but nobody says this): As an estimate for bone mass, the linear regression of bone mass onto body weight is 36% better than the estimate based on mean alone.

  22. Residuals and Residual Plots • A point’s residual is • Residual represents how close the regression line (“the model”) came to the actual y-value for a particular x. • Every time you run a Linreg a+bx in the calculator, a list called RESID is populated with fresh numbers in your calculator.

  23. Residuals and Residual Plots • The sum of all residuals = 0 • The average of all residuals = 0 • If you plot RESID versus X in a scatterplot, you get a so-called Residual Plot. If you spot no pattern, that’s good! It means your linear regression is the best model for the data. • On the other hand, if you do detect a pattern (such as high—low—high), that tells you that another type of regression would probably fit the data better. Consider quadratics, quartics, etc. (NOT PART OF THIS COURSE)

  24. Residuals and Residual Plots • Say you see a lot of divergence as x gets bigger. That means, “a line is the best model, but as x gets bigger, the model doesn’t predict y as well. You must read pgs. 170-171 in your textbook.

  25. Residuals and Residual Plots • Vital skills • Be able to create a residual plot using your calclulator. • Be able to interpret a residual plot. • Remember—Every time you add or remove data, run a fresh Linreg to repopulate RESID list.

  26. Influential Points • The question of what constitutes an outlier in 2-variable data isn’t as clear as with one-variable data, where we can use the 1.5 IQR rule. • Instead, the more interesting question is, “is this point influential.”

  27. Influential Points • A point is considered influential if it has a meaningful effect on the regression analysis. • In practice this means: • It causes a significant change in slope, or • It causes a significant change in intercept. (Note—this is usually far less interesting than a change in slope), or • It causes r to jump up or down significantly

  28. Influential Points • Needless to say, whether a point is influential is often debatable! Here’s Sir Francis again. Click to play with the scatterplot/regression applet some more. Notice that points near the x-bar, y-bar point are not influential, and that points far from this intersection—typically well to the left or right—are influential.

  29. More on the regression line (3.3) • Learn these, and you can solve for any missing part! • Example: We know that x-bar = 17.222, y-bar = 161.111, sx = 19.696, sy = 33.479, and r = 0.997. Construct the LSRL.

  30. From Fall 2009

  31. Fall 2009

  32. Self-check Quiz(Very similar to a test question) • Describe the relationship between the variables. • Interpret r in the context of the situation. • Interpret r2 in the context of the situation. • Restate the regression equation in context. • Interpret the regression equation.(In particular, interpret the coefficients in context.*) • Calculate the expected final exam grade for someone who misses 7 days of class during the semester. * And if you even think of mentioning x or y I’ll beat you with a stick. A linear regression stick.

  33. Section 4.2—Cautions and facts about regression • Read the section in its entirety. This is highly testable but I will not spend much class time on it. • Focus on • Everything that’s bold • Extrapolation • Lurking variables • Confounding and common response mechanisms by which lurking variables operate • Causality and how to establish it. Know the list on page 236.

  34. Causality—Does x cause y, or do they simply correlate strongly??

  35. Problem 56 b) The relationship between US and overseas stocks is moderately strong (r of .44), linear, and positive. There are no apparent outliers. The linear regression of Overseas stocks onto US stocks explains about 20% of the variability in Overseas stocks. (r2 = .194)

More Related