1 / 45

Descriptive measures of the strength of a linear association

Descriptive measures of the strength of a linear association. r- squared and the (Pearson) correlation coefficient r. Translating a research question into a statistical procedure. How strong is the linear relationship between skin cancer mortality and latitude?

Download Presentation

Descriptive measures of the strength of a linear association

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Descriptive measures of the strength of a linearassociation r-squared and the (Pearson) correlation coefficient r

  2. Translating a research question into a statistical procedure • How strong is the linear relationship between skin cancer mortality and latitude? • (Pearson) correlation coefficient r • Coefficient of determination r2

  3. Where does this topic fit in? • Model formulation • Model estimation • Model evaluation • Model use

  4. Situation #1A very weak linear relationship

  5. Situation #2A fairly strong linear relationship

  6. Coefficient of determination r2 • r2 is a number (a proportion!) between 0 and 1. • If r2 = 1: • all data points fall perfectly on the regression line • the predictor x accounts for all of the variation in y • If r2 = 0: • the fitted regression line is perfectly horizontal • the predictor x accounts for none of the variation in y

  7. Interpretation of r2 • r2 ×100 percent of the variation in y is reduced by taking into account predictor x. • r2 ×100 percent of the variation in y is “explained by” the variation in predictor x.

  8. R-sq in Minitab fitted line plot

  9. R-sq in Minitab regression output The regression equation is Mort = 389.189 - 5.97764 Lat S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Analysis of Variance Source DF SS MS F P Regression 1 36464.2 36464.2 99.7968 0.000 Error 47 17173.1 365.4 Total 48 53637.3

  10. Pearson correlation coefficient r If r2 is represented in decimal form, e.g. 0.39 or 0.87, then: • r is a (unitless) number between -1 and 1, inclusive. • Sign of coefficient of correlation • plus sign if slope of fitted regression line is positive • negative sign if slope of fitted regression line is negative

  11. Formulas for the Pearson correlation coefficient r

  12. What do we learn from the formulas for r? • The correlation coefficient r gets its sign from the slope b1. • The correlation coefficient r is a unitless measure. • The correlation coefficient r = 0 when the estimated slope b1 = 0 and vice versa.

  13. Interpretation of Pearson correlation coefficient r • There is no nice practical interpretation for r as there is for r2. • r = -1 is perfect negative linear relationship. • r = 1 is perfect positive linear relationship. • r = 0 is no linear relationship. • For other r, how strong the relationship between x and y is deemed depends on the research area.

  14. Pearson correlation coefficient rin Minitab Correlations: Lat, Mort Pearson correlation of Lat and Mort = -0.825 Correlations: Mort, Lat Pearson correlation of Mort and Lat = -0.825

  15. How strong is the linear relationship between Celsius and Fahrenheit? Pearson correlation of Celsius and Fahrenheit = 1.000

  16. How strong is the linear relationship between # of stories and height? Pearson correlation of HEIGHT and STORIES = 0.951

  17. How strong is the linear relationship between driver age and see distance? Pearson correlation of Distance and DrivAge = -0.801

  18. How strong is the linear relationship between height and g.p.a.? Pearson correlation of height and gpa = -0.053

  19. Caution #1 • The correlation coefficient r quantifies the strength of a linear relationship. • It is possible to get r = 0 with a perfect curvilinear relationship.

  20. Example of Caution #1 Pearson correlation of x and y = 0.000

  21. Clarification of Caution #1 Pearson correlation of x and y = 0.000

  22. Caution #2 • A large r2 value should not be interpreted as meaning that the estimated regression line fits the data well. • Another function might better describe the trend in the data.

  23. Example of Caution #2 Pearson correlation of Year and USPopn = 0.959

  24. Caution #3 • The coefficient of determination r2 and the correlation coefficient r can both be greatly affected by just one data point (or a few data points).

  25. Example of Caution #3 Pearson correlation of Deaths and Magnitude = 0.732

  26. Example of Caution #3 Pearson correlation of Deaths and Magnitude = -0.960

  27. Caution #4 • Correlation (association) does not imply causation.

  28. Example of Caution #4 Pearson correlation of Wine and Heart = -0.843

  29. Caution #5 • Ecological correlations are correlations that are based on rates or averages. • Ecological correlations tend to overstate the strength of an association.

  30. Example of Caution #5 • Data from 1988 Current Population Survey • Treating individuals as the units • Correlation between income and education for men age 25-64 in U.S. is r≈ 0.4. • Treating nine regions as the units • Compute average income and average education for men age 25-64 in each of the nine regions. • Correlation between the average incomes and the average education in U.S. is r≈ 0.7.

  31. Example of Caution #5

  32. Example of Caution #5

  33. Caution #6 • A “statistically significant” r2 does not imply that the slope β1is meaningfully different from 0.

  34. Caution #7 • A large r2 does not necessarily mean that a useful prediction of the response ynew (or estimation of the mean response μY) can be made. • It is still possible to get prediction (or confidence) intervals that are too wide to be useful.

  35. Using the sample correlation rto learn about the population correlation ρ

  36. Translating a research question into a statistical procedure • Is there a linear relationship between skin cancer mortality and latitude? • t-test for testing H0: β1= 0 • ANOVA F-test for testing H0: β1= 0 • Is there a linear correlation between husband’s age and wife’s age? • t-test for testing population correlation coefficient H0: ρ = 0

  37. Where does this topic fit in? • Model formulation • Model estimation • Model evaluation • Model use

  38. Is there a linear correlation between husband’s age and wife’s age? Pearson correlation of HAge and WAge = 0.939

  39. Is there a linear correlation between husband’s age and wife’s age? Pearson correlation of WAge and HAge = 0.939

  40. Test statistic P-value = What is the probability that we’d get a t* statistic as extreme as we did, if the null hypothesis is true? The formal t-test for correlation coefficient ρ Null hypothesisH0: ρ= 0 Alternative hypothesisHA: ρ≠ 0 or ρ < 0 or ρ > 0 The P-value is determined by comparing t* to a t distribution with n-2 degrees of freedom.

  41. Is there a linear correlation between husband’s age and wife’s age? Test statistic: Help in determining the P-value: Student's t distribution with 168 DF x P( X <= x ) 35.3900 1.0000 Just let Minitab do the work: Pearson correlation of WAge and HAge = 0.939 P-Value = 0.000

  42. When is it okay to use the t-test for testing H0: ρ = 0? • When it is not obvious which variable is the response. • When the (x, y) pairs are a random sample from a bivariate normal population. • For each x, the y’s are normal with equal variances. • For each y, the x’s are normal with equal variances. • Either, y can be considered a linear function of x. • Or, x can be considered a linear function of y. • The (x, y) pairs are independent.

  43. The three tests will always yield similar results. The regression equation is HAge = 3.59 + 0.967 Wage 170 cases used 48 cases contain missing values Predictor Coef SE Coef T P Constant 3.590 1.159 3.10 0.002 WAge 0.96670 0.02742 35.25 0.000 S = 4.069 R-Sq = 88.1% R-Sq(adj) = 88.0% Analysis of Variance Source DF SS MS F P Regression 1 20577 20577 1242.51 0.000 Error 168 2782 17 Total 169 23359 Pearson correlation of WAge and HAge = 0.939 P-Value = 0.000

  44. The three tests will always yield similar results. The regression equation is WAge = 1.57 + 0.911 HAge 170 cases used 48 cases contain missing values Predictor Coef SE Coef T P Constant 1.574 1.150 1.37 0.173 HAge 0.91124 0.02585 35.250.000 S = 3.951 R-Sq = 88.1% R-Sq(adj) = 88.0% Analysis of Variance Source DF SS MS F P Regression 1 19396 19396 1242.51 0.000 Error 168 2623 16 Total 169 22019 Pearson correlation of WAge and HAge = 0.939 P-Value = 0.000

  45. Which results should I report? • If one of the variables can be clearly identified as the response, report the t-test or F-test results for testing H0: β1= 0. • Does it make sense to use x to predict y? • If it is not obvious which variable is the response, report the t-test results for testing H0: ρ = 0. • Does it only make sense to look for an association between x and y?

More Related