1 / 55

CHAPTER 7 Linear Correlation & Regression Methods

CHAPTER 7 Linear Correlation & Regression Methods. 7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression. Parameter Estimation via SAMPLE DATA …. Testing for association between two POPULATION variables X and Y ….

Download Presentation

CHAPTER 7 Linear Correlation & Regression Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CHAPTER 7Linear Correlation & Regression Methods 7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression

  2. Parameter Estimation via SAMPLE DATA … Testing for association between two POPULATION variables X and Y… • Categorical variables • Numerical variables Chi-squared Test ??????? PARAMETERS • Means: • Variances: • Covariance: Examples: X = Disease status (D+, D–) Y = Exposure status (E+, E–) X = # children in household (0, 1-2, 3-4, 5+) Y = Income level (Low, Middle, High)

  3. Parameter Estimation via SAMPLE DATA … • Numerical variables ??????? PARAMETERS PARAMETERS STATISTICS • Means: • Means: • Variances: • Variances: • Covariance: • Covariance: (can be +, –, or 0)

  4. Parameter Estimation via SAMPLE DATA … • Numerical variables ??????? PARAMETERS PARAMETERS STATISTICS Y • Means: • Means: • Variances: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) • Covariance: • Covariance: (can be +, –, or 0) X

  5. Parameter Estimation via SAMPLE DATA … • Numerical variables ??????? PARAMETERS PARAMETERS STATISTICS Y • Means: • Means: • Variances: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) • Covariance: • Covariance: (can be +, –, or 0) Does this suggest a linear trend between X and Y? If so, how do we measure it? X

  6. LINEAR Testing for association between two population variables X and Y… ^ • Numerical variables ??????? PARAMETERS • Means: • Variances: • Covariance: • Linear Correlation Coefficient: Always between –1 and +1

  7. Parameter Estimation via SAMPLE DATA … • Numerical variables ??????? PARAMETERS PARAMETERS STATISTICS Y • Means: • Means: • Variances: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) • Covariance: • Covariance: (can be +, –, or 0) • Linear Correlation Coefficient: Always between –1 and +1 X

  8. Parameter Estimation via SAMPLE DATA … Example in R (reformatted for brevity): • Numerical variables > pop = seq(0, 20, 0.1) > x = sort(sample(pop, 10)) 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 > y = sample(pop, 10) 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ??????? PARAMETERS PARAMETERS STATISTICS Y > c(mean(x), mean(y)) 7.05 12.08 > var(x) 29.48944 > var(y) 43.76178 • Means: • Means: • Variances: • Variances: JAMA. 2003;290:1486-1493 plot(x, y, pch = 19) Scatterplot n = 10 (n data points) • Covariance: • Covariance: > cov(x, y) -25.86667 (can be +, –, or 0) • Linear Correlation Coefficient: Always between –1 and +1 > cor(x, y) -0.7200451 X

  9. Parameter Estimation via SAMPLE DATA … • Numerical variables • Linear Correlation Coefficient: Always between –1 and +1 Y r measures the strength of linear association JAMA. 2003;290:1486-1493 Scatterplot (n data points) X

  10. Parameter Estimation via SAMPLE DATA … • Numerical variables • Linear Correlation Coefficient: Always between –1 and +1 Y r measures the strength of linear association JAMA. 2003;290:1486-1493 Scatterplot (n data points) r –1 0 +1 X negative linear correlation positive linear correlation

  11. Parameter Estimation via SAMPLE DATA … • Numerical variables • Linear Correlation Coefficient: Always between –1 and +1 Y r measures the strength of linear association JAMA. 2003;290:1486-1493 Scatterplot (n data points) r –1 0 +1 X negative linear correlation positive linear correlation

  12. Parameter Estimation via SAMPLE DATA … • Numerical variables • Linear Correlation Coefficient: Always between –1 and +1 Y r measures the strength of linear association r measures the strength of linear association JAMA. 2003;290:1486-1493 Scatterplot (n data points) r –1 0 +1 X negative linear correlation positive linear correlation

  13. Parameter Estimation via SAMPLE DATA … • Numerical variables • Linear Correlation Coefficient: Always between –1 and +1 Y r measures the strength of linear association > cor(x, y) -0.7200451 JAMA. 2003;290:1486-1493 Scatterplot (n data points) r –1 0 +1 X negative linear correlation positive linear correlation

  14. Testing for linear association between two numerical population variables X and Y… Now that we have r, we can conduct HYPOTHESIS TESTING on  • Linear Correlation Coefficient Test Statistic for p-value • Linear Correlation Coefficient p-value = .0189 < .05 2 * pt(-2.935, 8)

  15. Parameter Estimation via SAMPLE DATA … If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… • Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) -0.7200451 in what sense??? Residuals Find estimates and for the “best” line

  16. Parameter Estimation via SAMPLE DATA … SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… • Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) -0.7200451 “Least Squares Regression Line” in what sense??? i.e., that minimizes Residuals Find estimates and for the “best” line

  17. SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… • Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line Check 

  18. SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

  19. SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

  20. SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

  21. SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response residuals > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

  22. SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response residuals > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

  23. Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 • Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value? • Linear Regression Coefficients

  24. SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response residuals > cor(x, y) -0.7200451 i.e., that minimizes Residuals Find estimates and for the “best” line

  25. Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 • Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value • Linear Regression Coefficients p-value = .0189 Same t-score as H0:  = 0!

  26. > plot(x, y, pch = 19) > lsreg = lm(y ~ x) # or lsfit(x,y) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 *** x -0.8772 0.2989 -2.935 0.018857 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM??? Because this second method generalizes…

  27. Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” For now, assume the “additive model,” i.e., main effects only.

  28. Y True response yi Residual Fitted response X2 0 (x1i , x2i) Predictors X1 Multilinear Regression Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)! Once calculated, how do we then test the null hypothesis? ANOVA

  29. ANOVA Table

  30. ANOVA Table

  31. ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points. ? ? ?

  32. Parameter Estimation via SAMPLE DATA … STATISTICS • Means: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) SSTotis a measure of the total amount of variability in the observed responses (i.e., before any model-fitting).

  33. Parameter Estimation via SAMPLE DATA … STATISTICS • Means: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) SSRegis a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.)

  34. Parameter Estimation via SAMPLE DATA … STATISTICS • Means: • Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) SSErris a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting).

  35. ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points. ? ? ?

  36. ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points.

  37. SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor observed response fitted response residuals > cor(x, y) -0.7200451 = 204.2 = 189.656 = 9 (43.76178) Residuals = 393.856

  38. SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES Tot Err predictor observed response Reg fitted response residuals > cor(x, y) -0.7200451 = 204.2 = 189.656 = 393.856 Residuals minimum SSTot = SSReg + SSErr

  39. ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points.

  40. ANOVA Table In our example, k = 2 regression coefficients and n = 10 data points. Same as before!

  41. > summary(aov(lsreg)) Df Sum Sq Mean Sq F value Pr(>F) x 1 204.20 204.201 8.6135 0.01886 * Residuals 8 189.66 23.707

  42. Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

  43. > cor(x, y) -0.7200451 Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

  44. > plot(x, y, pch = 19) > lsreg = lm(y ~ x) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 *** x -0.8772 0.2989 -2.935 0.018857 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

  45. Summary of Linear Correlation and Simple Linear Regression Means Variances Covariance X Y Given: • Linear Correlation Coefficient Y JAMA. 2003;290:1486-1493 –1 r +1 measures the strength of linear association • Least Squares Regression Line minimizesSSErr = X = SSTot – SSReg (ANOVA)

  46. Summary of Linear Correlation and Simple Linear Regression Means Variances Covariance X Y Given: • Linear Correlation Coefficient Y JAMA. 2003;290:1486-1493 –1 r +1 measures the strength of linear association • Least Squares Regression Line minimizesSSErr = X = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc. proportion of total variability modeled by the regression line’s variability. • Coefficient of Determination

  47. Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” R code example: lsreg= lm(y ~ x1+x2+x3)

  48. Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” quadratic terms, etc. (“polynomial regression”) R code example: lsreg= lm(y ~ x1+x2+x3) R code example: lsreg= lm(y ~ x+x^2+x^3)

  49. Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” quadratic terms, etc. (“polynomial regression”) “interactions” “interactions” R code example: lsreg= lm(y ~ x+x^2+x^3) R code example: lsreg= lm(y ~ x1+x2+x1:x2) R code example: lsreg= lm(y ~ x1*x2)

More Related