450 likes | 602 Views
Descriptive measures of the strength of a linear association. r- squared and the (Pearson) correlation coefficient r. Translating a research question into a statistical procedure. How strong is the linear relationship between skin cancer mortality and latitude?
E N D
Descriptive measures of the strength of a linearassociation r-squared and the (Pearson) correlation coefficient r
Translating a research question into a statistical procedure • How strong is the linear relationship between skin cancer mortality and latitude? • (Pearson) correlation coefficient r • Coefficient of determination r2
Where does this topic fit in? • Model formulation • Model estimation • Model evaluation • Model use
Coefficient of determination r2 • r2 is a number (a proportion!) between 0 and 1. • If r2 = 1: • all data points fall perfectly on the regression line • the predictor x accounts for all of the variation in y • If r2 = 0: • the fitted regression line is perfectly horizontal • the predictor x accounts for none of the variation in y
Interpretation of r2 • r2 ×100 percent of the variation in y is reduced by taking into account predictor x. • r2 ×100 percent of the variation in y is “explained by” the variation in predictor x.
R-sq in Minitab regression output The regression equation is Mort = 389.189 - 5.97764 Lat S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Analysis of Variance Source DF SS MS F P Regression 1 36464.2 36464.2 99.7968 0.000 Error 47 17173.1 365.4 Total 48 53637.3
Pearson correlation coefficient r If r2 is represented in decimal form, e.g. 0.39 or 0.87, then: • r is a (unitless) number between -1 and 1, inclusive. • Sign of coefficient of correlation • plus sign if slope of fitted regression line is positive • negative sign if slope of fitted regression line is negative
What do we learn from the formulas for r? • The correlation coefficient r gets its sign from the slope b1. • The correlation coefficient r is a unitless measure. • The correlation coefficient r = 0 when the estimated slope b1 = 0 and vice versa.
Interpretation of Pearson correlation coefficient r • There is no nice practical interpretation for r as there is for r2. • r = -1 is perfect negative linear relationship. • r = 1 is perfect positive linear relationship. • r = 0 is no linear relationship. • For other r, how strong the relationship between x and y is deemed depends on the research area.
Pearson correlation coefficient rin Minitab Correlations: Lat, Mort Pearson correlation of Lat and Mort = -0.825 Correlations: Mort, Lat Pearson correlation of Mort and Lat = -0.825
How strong is the linear relationship between Celsius and Fahrenheit? Pearson correlation of Celsius and Fahrenheit = 1.000
How strong is the linear relationship between # of stories and height? Pearson correlation of HEIGHT and STORIES = 0.951
How strong is the linear relationship between driver age and see distance? Pearson correlation of Distance and DrivAge = -0.801
How strong is the linear relationship between height and g.p.a.? Pearson correlation of height and gpa = -0.053
Caution #1 • The correlation coefficient r quantifies the strength of a linear relationship. • It is possible to get r = 0 with a perfect curvilinear relationship.
Example of Caution #1 Pearson correlation of x and y = 0.000
Clarification of Caution #1 Pearson correlation of x and y = 0.000
Caution #2 • A large r2 value should not be interpreted as meaning that the estimated regression line fits the data well. • Another function might better describe the trend in the data.
Example of Caution #2 Pearson correlation of Year and USPopn = 0.959
Caution #3 • The coefficient of determination r2 and the correlation coefficient r can both be greatly affected by just one data point (or a few data points).
Example of Caution #3 Pearson correlation of Deaths and Magnitude = 0.732
Example of Caution #3 Pearson correlation of Deaths and Magnitude = -0.960
Caution #4 • Correlation (association) does not imply causation.
Example of Caution #4 Pearson correlation of Wine and Heart = -0.843
Caution #5 • Ecological correlations are correlations that are based on rates or averages. • Ecological correlations tend to overstate the strength of an association.
Example of Caution #5 • Data from 1988 Current Population Survey • Treating individuals as the units • Correlation between income and education for men age 25-64 in U.S. is r≈ 0.4. • Treating nine regions as the units • Compute average income and average education for men age 25-64 in each of the nine regions. • Correlation between the average incomes and the average education in U.S. is r≈ 0.7.
Caution #6 • A “statistically significant” r2 does not imply that the slope β1is meaningfully different from 0.
Caution #7 • A large r2 does not necessarily mean that a useful prediction of the response ynew (or estimation of the mean response μY) can be made. • It is still possible to get prediction (or confidence) intervals that are too wide to be useful.
Using the sample correlation rto learn about the population correlation ρ
Translating a research question into a statistical procedure • Is there a linear relationship between skin cancer mortality and latitude? • t-test for testing H0: β1= 0 • ANOVA F-test for testing H0: β1= 0 • Is there a linear correlation between husband’s age and wife’s age? • t-test for testing population correlation coefficient H0: ρ = 0
Where does this topic fit in? • Model formulation • Model estimation • Model evaluation • Model use
Is there a linear correlation between husband’s age and wife’s age? Pearson correlation of HAge and WAge = 0.939
Is there a linear correlation between husband’s age and wife’s age? Pearson correlation of WAge and HAge = 0.939
Test statistic P-value = What is the probability that we’d get a t* statistic as extreme as we did, if the null hypothesis is true? The formal t-test for correlation coefficient ρ Null hypothesisH0: ρ= 0 Alternative hypothesisHA: ρ≠ 0 or ρ < 0 or ρ > 0 The P-value is determined by comparing t* to a t distribution with n-2 degrees of freedom.
Is there a linear correlation between husband’s age and wife’s age? Test statistic: Help in determining the P-value: Student's t distribution with 168 DF x P( X <= x ) 35.3900 1.0000 Just let Minitab do the work: Pearson correlation of WAge and HAge = 0.939 P-Value = 0.000
When is it okay to use the t-test for testing H0: ρ = 0? • When it is not obvious which variable is the response. • When the (x, y) pairs are a random sample from a bivariate normal population. • For each x, the y’s are normal with equal variances. • For each y, the x’s are normal with equal variances. • Either, y can be considered a linear function of x. • Or, x can be considered a linear function of y. • The (x, y) pairs are independent.
The three tests will always yield similar results. The regression equation is HAge = 3.59 + 0.967 Wage 170 cases used 48 cases contain missing values Predictor Coef SE Coef T P Constant 3.590 1.159 3.10 0.002 WAge 0.96670 0.02742 35.25 0.000 S = 4.069 R-Sq = 88.1% R-Sq(adj) = 88.0% Analysis of Variance Source DF SS MS F P Regression 1 20577 20577 1242.51 0.000 Error 168 2782 17 Total 169 23359 Pearson correlation of WAge and HAge = 0.939 P-Value = 0.000
The three tests will always yield similar results. The regression equation is WAge = 1.57 + 0.911 HAge 170 cases used 48 cases contain missing values Predictor Coef SE Coef T P Constant 1.574 1.150 1.37 0.173 HAge 0.91124 0.02585 35.250.000 S = 3.951 R-Sq = 88.1% R-Sq(adj) = 88.0% Analysis of Variance Source DF SS MS F P Regression 1 19396 19396 1242.51 0.000 Error 168 2623 16 Total 169 22019 Pearson correlation of WAge and HAge = 0.939 P-Value = 0.000
Which results should I report? • If one of the variables can be clearly identified as the response, report the t-test or F-test results for testing H0: β1= 0. • Does it make sense to use x to predict y? • If it is not obvious which variable is the response, report the t-test results for testing H0: ρ = 0. • Does it only make sense to look for an association between x and y?