330 likes | 335 Views
Explore the concepts of residual, correlation coefficient, and pitfalls in regression analysis. Learn how to interpret correlation values and the importance of r-squared. Demystify line fitting interpretations and challenges in data extrapolation.
E N D
Statistics 200 Objectives (for two quantitative variables and their relationship): • Define and interpret residual • Define and interpret correlation coefficient • Interpret the square of the correlation coefficient (r-squared) • Recognize various pitfalls in using regression – Extrapolation is dangerous. – Outliers can have a huge effect. – Interpreting a linear relationship as causation is dangerous. Lecture #6 Thursday, September 8, 2016 Textbook: Sections 3.3 through 3.5
For which fitted line plot(s) does the y-intercept have a logical interpretation? • line plot 1 • line plot 2 • line plot 3 • line plots 1 & 3 • line plots 1, 2, & 3
Residual:Deviation of Point from the Regression Line = observed - predicted
Measuring strength and direction • We see a linear pattern in relationships so often that we use a statistic to characterize the strengthand direction of the relationship.
Measuring strength and direction • The strength of correlation is determined by the ________of the points to a straight line. • The direction of correlation is determined by whether one variable generally increases or decreases when the other variable increases. closeness
Measuring strength and direction • As a note, correlation can only be used when talking about linear (straight line) relationships. • Sometimes there definitely is a relationship, but the correlation may be zero because it isn’t a linear relationship.
Measuring strength and direction • Correlation is represented by the letter r • Correlation is sometimes called the Pearson product moment correlation, or the correlation coefficient.
Measuring strength and direction • For correlation: It doesn’t matter which variable you treat as the response and which variable you treat as the explanatory variable • For the regression equation, it DOES matter which variable you treat as the response and which you treat as the explanatory.
How is correlation (r) calculated? • The formula for calculating correlation looks quite complicated, but it is more easily explained in terms of standardized scores (z-scores) • Approximately, the correlation is the average product of standardized scores (z-scores) for variables x and y.
Interpreting correlation • Correlation values are always between __ and __ • The further the correlation is from zero, the _________the relationship • Whether the correlation is positive or negative indicates the ________of the relationship. 1 –1 stronger direction
Interpreting correlation • If correlation is equal to 0, there is no linear relationship between the variables. • This also means that the best line to fit the relationship is exactly horizontal, such that y does not change with x. • If the correlation is –1 or 1, then all of the data points fall exactly on a line.
Clicker Question • For the top scatterplot… A. r = .721 B. r = -0.193 C. r = -.927 For the bottom scatterplot… A. r = 0.656 B. r = -0.012 C. R = 1.00
Related quantity: Squared correlation • The squared value for the correlation (r2)is often used to describe the strength of the linear relationship. • Since the r2 value is simply r squared, the possible values for r2 range from 0 to 1.
Squared Correlation (r2) • friendly quantity • Interpretation: quantifies the amount of ________ in the ______ variable that can be __________ by the _________variable • possible values ____ to ____(0% to 100%) • as it increases in value, the _________ the points are to the regression line. variation response explained explanatory 1 0 closer
Example: Squared Correlation Squared Correlation Interpretation: • ________ of the variation in ____can be explained by ____. Pearson correlation of x and y = 0.844 The regression equation is y = 3.900 + 1.6 x S = 3.6 Pearson correlation of x and y = 0.844 (.844)*(.844) = .713 r2= ______________ 71.3% y x
Issues with regression • Several problems can arise when you are analyzing the relationship between two quantitative variables: • Extrapolation • Influential Outliers • Curvilinear Data • Combining Groups Inappropriately
Extrapolation • Extrapolation is when you use the regression equation to predict values _________the range of observed data. • For example, let’s look at height and weight data. outside
Extrapolation • Here, we use height to predict weight, using a sample of adults. Sample intercept is –195.9. No logical interpretation Weight = –195.9 + 5.175 Height
Extrapolation • Here, we use height to predict weight, using a sample of adults. Sample intercept is –195.9. No logical interpretation Weight = –195.9 + 5.175 Height Sample slope is 5.175. For every increase of 1”, predicted weight increases by 5.175 lbs.
Extrapolation • Here, we use height to predict weight, using a sample of adults. Sample intercept is –195.9. No logical interpretation Weight = –195.9 + 5.175 Height Sample slope is 5.175. For every increase of 1”, predicted weight increases by 5.175 lbs. R-sq is 0.43 = 43%. Height explains 43% of the variation observed in weight.
Extrapolation • This regression equation works fairly well for adults, but what happens for a child’s height? • If child is 40” tall, use the equation to predict their weight. Weight = –195.9 + 5.175 Height = –195.9 + 5.175 × 40 = –195.9 + 207 = 11.1 pounds Yikes! We can’t trust this value because we extrapolated outside the range of observed values to get it.
Influential Outliers - example • Consider this scatterplot and regression equation R-sq is 0. No linear relationship between variables
Influential Outliers - example • Consider this scatterplot and regression equation R-sq is 0. No relationship between variables Slope of line is basically 0
Influential Outliers - example • Now we add a single influential outlier R-sq is 8.5%. Possible linear relationship between variables
Influential Outliers - example • Now we add a single influential outlier R-sq is 8.5%. Possible linear relationship between variables Slope of line is –.2745 : not zero!
Moral of the example • Influential outliers can have a huge effect on the relationship. • In some cases, there is no relationship at all unless you include one data point. • In cases like these, it may be best to remove the outlier before fitting a line to the data and making assumptions.
Curvilinear data • Be careful using linear regression on a curvilinear dataset. • Problem: If you use the equation, you will end up making incorrect estimates for the data. • Example: United States population is plotted by year. Population = –2485 + 1.363 Year If we try to calculate the population for 2009, we get: –2485 + 1.363 × 2009 = 253.267 mil This was our population around 1999.
Combining groups inappropriately • Plot of “fastest speed ever reached in a car and height”. • What? Taller people speed more?! • Wait, maybe we combined some groups that we should have kept separate. Terminology: Here, sex is a confounding variable!
Combining groups inappropriately • Separated by sex (M, F), we see that there is actually no apparent relationship between height and ‘fastest speed ever driven’.
Interpretations of observed association • There are four main ways for you to interpret an observed association between two quantitative variables: • There is causation • There may be causation • There is no causation • The response variable is causing a change in the explanatory variable (reverse causation)
Important note... • A strong correlation does not necessarily mean there is a causal relationship between two variables. • Most correlations come from observational studies and we can’t claim causation from observational studies! • All a strong correlation means is that there is an association between the variables.
Order these scatterplots in increasing order of r-squared k j m A. j < k < m B. m < j < k C. j < m < k D. k < m < j
Review: If you understood today’s lecture, you should be able to solve • 3.27, 3.29, 3.33, 3.37, 3.39, 3.41, 3.45, 3.63, 3.75, 3.77. Recall objectives (for two quantitative variables): • Define and interpret residual • Define and interpret correlation coefficient • Interpret the square of the correlation coefficient (r-squared) • Recognize various pitfalls in using regression – Extrapolation is dangerous. – Outliers can have a huge effect. – Interpreting a linear relationship as causation is dangerous.