1 / 33

Statistics 200

Explore the concepts of residual, correlation coefficient, and pitfalls in regression analysis. Learn how to interpret correlation values and the importance of r-squared. Demystify line fitting interpretations and challenges in data extrapolation.

beckyh
Download Presentation

Statistics 200

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics 200 Objectives (for two quantitative variables and their relationship): • Define and interpret residual • Define and interpret correlation coefficient • Interpret the square of the correlation coefficient (r-squared) • Recognize various pitfalls in using regression – Extrapolation is dangerous. – Outliers can have a huge effect. – Interpreting a linear relationship as causation is dangerous. Lecture #6 Thursday, September 8, 2016 Textbook: Sections 3.3 through 3.5

  2. For which fitted line plot(s) does the y-intercept have a logical interpretation? • line plot 1 • line plot 2 • line plot 3 • line plots 1 & 3 • line plots 1, 2, & 3

  3. Residual:Deviation of Point from the Regression Line = observed - predicted

  4. Measuring strength and direction • We see a linear pattern in relationships so often that we use a statistic to characterize the strengthand direction of the relationship.

  5. Measuring strength and direction • The strength of correlation is determined by the ________of the points to a straight line. • The direction of correlation is determined by whether one variable generally increases or decreases when the other variable increases. closeness

  6. Measuring strength and direction • As a note, correlation can only be used when talking about linear (straight line) relationships. • Sometimes there definitely is a relationship, but the correlation may be zero because it isn’t a linear relationship.

  7. Measuring strength and direction • Correlation is represented by the letter r • Correlation is sometimes called the Pearson product moment correlation, or the correlation coefficient.

  8. Measuring strength and direction • For correlation: It doesn’t matter which variable you treat as the response and which variable you treat as the explanatory variable • For the regression equation, it DOES matter which variable you treat as the response and which you treat as the explanatory.

  9. How is correlation (r) calculated? • The formula for calculating correlation looks quite complicated, but it is more easily explained in terms of standardized scores (z-scores) • Approximately, the correlation is the average product of standardized scores (z-scores) for variables x and y.

  10. Interpreting correlation • Correlation values are always between __ and __ • The further the correlation is from zero, the _________the relationship • Whether the correlation is positive or negative indicates the ________of the relationship. 1 –1 stronger direction

  11. Interpreting correlation • If correlation is equal to 0, there is no linear relationship between the variables. • This also means that the best line to fit the relationship is exactly horizontal, such that y does not change with x. • If the correlation is –1 or 1, then all of the data points fall exactly on a line.

  12. Clicker Question • For the top scatterplot… A. r = .721 B. r = -0.193 C. r = -.927 For the bottom scatterplot… A. r = 0.656 B. r = -0.012 C. R = 1.00

  13. Related quantity: Squared correlation • The squared value for the correlation (r2)is often used to describe the strength of the linear relationship. • Since the r2 value is simply r squared, the possible values for r2 range from 0 to 1.

  14. Squared Correlation (r2) • friendly quantity • Interpretation: quantifies the amount of ________ in the ______ variable that can be __________ by the _________variable • possible values ____ to ____(0% to 100%) • as it increases in value, the _________ the points are to the regression line. variation response explained explanatory 1 0 closer

  15. Example: Squared Correlation Squared Correlation Interpretation: • ________ of the variation in ____can be explained by ____. Pearson correlation of x and y = 0.844 The regression equation is y = 3.900 + 1.6 x S = 3.6 Pearson correlation of x and y = 0.844 (.844)*(.844) = .713 r2= ______________ 71.3% y x

  16. Issues with regression • Several problems can arise when you are analyzing the relationship between two quantitative variables: • Extrapolation • Influential Outliers • Curvilinear Data • Combining Groups Inappropriately

  17. Extrapolation • Extrapolation is when you use the regression equation to predict values _________the range of observed data. • For example, let’s look at height and weight data. outside

  18. Extrapolation • Here, we use height to predict weight, using a sample of adults. Sample intercept is –195.9. No logical interpretation Weight = –195.9 + 5.175 Height

  19. Extrapolation • Here, we use height to predict weight, using a sample of adults. Sample intercept is –195.9. No logical interpretation Weight = –195.9 + 5.175 Height Sample slope is 5.175. For every increase of 1”, predicted weight increases by 5.175 lbs.

  20. Extrapolation • Here, we use height to predict weight, using a sample of adults. Sample intercept is –195.9. No logical interpretation Weight = –195.9 + 5.175 Height Sample slope is 5.175. For every increase of 1”, predicted weight increases by 5.175 lbs. R-sq is 0.43 = 43%. Height explains 43% of the variation observed in weight.

  21. Extrapolation • This regression equation works fairly well for adults, but what happens for a child’s height? • If child is 40” tall, use the equation to predict their weight. Weight = –195.9 + 5.175 Height = –195.9 + 5.175 × 40 = –195.9 + 207 = 11.1 pounds Yikes! We can’t trust this value because we extrapolated outside the range of observed values to get it.

  22. Influential Outliers - example • Consider this scatterplot and regression equation R-sq is 0. No linear relationship between variables

  23. Influential Outliers - example • Consider this scatterplot and regression equation R-sq is 0. No relationship between variables Slope of line is basically 0

  24. Influential Outliers - example • Now we add a single influential outlier R-sq is 8.5%. Possible linear relationship between variables

  25. Influential Outliers - example • Now we add a single influential outlier R-sq is 8.5%. Possible linear relationship between variables Slope of line is –.2745 : not zero!

  26. Moral of the example • Influential outliers can have a huge effect on the relationship. • In some cases, there is no relationship at all unless you include one data point. • In cases like these, it may be best to remove the outlier before fitting a line to the data and making assumptions.

  27. Curvilinear data • Be careful using linear regression on a curvilinear dataset. • Problem: If you use the equation, you will end up making incorrect estimates for the data. • Example: United States population is plotted by year. Population = –2485 + 1.363 Year If we try to calculate the population for 2009, we get: –2485 + 1.363 × 2009 = 253.267 mil This was our population around 1999.

  28. Combining groups inappropriately • Plot of “fastest speed ever reached in a car and height”. • What? Taller people speed more?! • Wait, maybe we combined some groups that we should have kept separate. Terminology: Here, sex is a confounding variable!

  29. Combining groups inappropriately • Separated by sex (M, F), we see that there is actually no apparent relationship between height and ‘fastest speed ever driven’.

  30. Interpretations of observed association • There are four main ways for you to interpret an observed association between two quantitative variables: • There is causation • There may be causation • There is no causation • The response variable is causing a change in the explanatory variable (reverse causation)

  31. Important note... • A strong correlation does not necessarily mean there is a causal relationship between two variables. • Most correlations come from observational studies and we can’t claim causation from observational studies! • All a strong correlation means is that there is an association between the variables.

  32. Order these scatterplots in increasing order of r-squared k j m A. j < k < m B. m < j < k C. j < m < k D. k < m < j

  33. Review: If you understood today’s lecture, you should be able to solve • 3.27, 3.29, 3.33, 3.37, 3.39, 3.41, 3.45, 3.63, 3.75, 3.77. Recall objectives (for two quantitative variables): • Define and interpret residual • Define and interpret correlation coefficient • Interpret the square of the correlation coefficient (r-squared) • Recognize various pitfalls in using regression – Extrapolation is dangerous. – Outliers can have a huge effect. – Interpreting a linear relationship as causation is dangerous.

More Related