460 likes | 467 Views
Learn the difference between correlation and linear regression, including how to compute correlation coefficients, predict values using linear regression, and understand assumptions and sources of variation in linear regression.
E N D
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHAProfessor and Executive Director, Research CenterUniversity of South Florida, College of NursingProfessor, College of Public HealthDepartment of Epidemiology and BiostatisticsAssociate Member, Byrd Alzheimer’s InstituteMorsani College of MedicineTampa, FL, USA
SECTION 6.1 Correlation versus linear regression
Learning Outcome: Distinguish the relationship between correlation and linear regression
Correlation and Regression are both measures of association Some Terms for “association” variables: Variable 1: “x” variable independent variable predictor variable exposure variable Variable 2: “y” variable dependent variable outcome variable
Correlation Coefficient Computation Form: Pearson correlation (“r”) Co-variation where x and y are the sample meansof X and Y, sx and sy are the sample standard deviations of X and Y.
Introduction to Linear Regression Like correlation, the data are pairs of independent (e.g. “X”) and dependent (e.g. “Y” variables {(xi,yi): i=1,...,n}. However, here we seek to predict values of Y from X. The fitted equation is written: y = b0 + b1xwhere y is the predicted value of the response (e.g. blood pressure) obtained by using the equation. This equation of the line best represents the association between the independent variable and the dependent variable The residuals are the differences between the observed and the predicted values: {(yi – yi): i=1,…,n}
Introduction to Linear Regression Best fitting line Minimize distance between predicted and actual values r = 0.76
Introduction to Linear Regression y = b0 + b1x y = predicted value of response (outcome) variable b0 = constant: the intercept (the value of y when x = 0). b1 = constant: coefficient for slope of regression line – the expected change in y for a one-unit change in x Note: unlike the correlation coefficient, b is unbounded. xi = values of independent (predictor) variable for subject i
SECTION 6.2 Least squares regression and predicted values
Learning Outcomes: Describe the theoretical basis of least squares regression Calculate and interpret predicted values from a linear regression model
Introduction to Linear Regression y = b0 + b1x In the above equation, the values of the slope (b1) and intercept (b0) represent the line that best predicts Y from X. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line. i.e. minimize ∑(y – y)2 This is frequently done by the method of “least squares” regression.
Least squares estimates: sy b1 = r sx b0 = Y – b1X Example: We wish to estimate total cholesterol level (y) from BMI (x) Assume rxy= 0.78; Y = 205.9 sy = 30.8 X = 27.4 sx = 3.7 sy 30.8 b1 = r = 0.78 = 6.49 sx 3.7 b0 = Y – b1X = 205.9 – 6.49(27.4) = 28.07 The equation of the regression line is: y = 28.07 + 6.49(BMI)
Least squares estimates: (Practice) sy b1 = r sx b0 = Y – b1X Example: We wish to estimate systolic blood pressure (y) from BMI (x) Assume rxy= 0.46; Y = 133.8 sy = 18.4 X = 26.6 sx = 3.5 sy b1 = r = sx b0 = Y – b1X = The equation of the regression line is: y =
Least squares estimates: (Practice) sy b1 = r sx b0 = Y – b1X Example: We wish to estimate systolic blood pressure (y) from BMI (x) Assume rxy= 0.46; Y = 133.8 sy = 18.4 X = 26.6 sx = 3.5 sy 18.4 b1 = r = 0.46 = 2.42 sx 3.5 b0 = Y – b1X = 133.8 – 2.42(26.6) = 69.43 The equation of the regression line is: y = 69.43 + 2.42(BMI)
Least squares estimates: (Practice) The equation of the regression line is: y = 69.43 + 2.42(BMI) Predict systolic blood pressure for the following 3 individuals: Person 1 has BMI of 26.4 Person 2 has BMI of 28.9 Person 3 has BMI of 34.8 y1 = y2 = y3 =
Least squares estimates: (Practice) The equation of the regression line is: y = 69.43 + 2.42(BMI) Predict systolic blood pressure for the following 3 individuals: Person 1 has BMI of 26.4 Person 2 has BMI of 28.9 Person 3 has BMI of 34.8 y1 = 69.43 + 2.42(26.4) = 133.3 y2 = 69.43 + 2.42(28.9) = 139.4 y3 = 69.43 + 2.42(34.8) = 153.6
SECTION 6.3 Assumptions and sources of variation in linear regression 17
Learning Outcomes: Describe the assumptions required for valid use of the linear regression model Describe the partitioning of sum of squares in the linear regression model
Introduction to Linear Regression • Some assumptions for linear regression: • Dependent variable Y has a linear relationship to the independent variable X • This includes checking whether the dependent • variable is approximately normally distributed. • Independence of the errors (no serial correlation)
Y = 90.681 + 0.945(age) R = 0.597
Fundamental Equations for Regression • Coefficient of determination (r2) • Proportion of variation in Y “explained by the regression on X • explained variation SSR SSE • R2 = ----------------------- = ----- = 1 - ------ • total variation SST SST
Example: Fundamental Equations for Regression Y r = 0.42 X y = b0 + b1x y = 9.545 + 0.477(x)
Example: Fundamental Equations for Regression y = 9.545 + 0.477(x) SST = 132, dfT = 11 SSR = 23, dfR = 1 SSE = 109, dfE = 10 SSR R2 = ----- = 0.18 SST
Practice: Fundamental Equations for Regression y = 17.17247 - 0.53707(x) Complete the entries in the table below to determine SST, SSR, SSE, and R2 SST = _____, dfT = ____ SSR = ______, dfR = ____ SSE = ______, dfE = ____ SSR R2 = ----- = _______ SST
Practice: Fundamental Equations for Regression y = 17.17247 - 0.53707(x) SST = 80.5, dfT = 9 SSR = 19.1, dfR = 1 SSE = 61.4, dfE = 8 SSR R2 = ----- = 0.24 SST
SECTION 6.4 Multiple linear regression model 30 30
Learning Outcome: Calculate and interpret predicted values from the multiple regression model 31
Multiple Linear Regression • Extension of simple linear regression to assess the association between 2 or more independent variables and a single continuous dependent variable. • The multiple linear regression equation is: • Each regression coefficient represents the change in y relative to a one unit change in the respective independent variable holding the remaining independent variables constant. • The R2 from the multiple linear regression model represents percentage of variation in the dependent variable “explained” by the set of predictors.
Multiple Linear Regression Example: Predictors of systolic blood pressure: y= 68.15 + 0.58(BMI) + 0.65(age) + 0.94(male) + 6.44 (tx-hypertension)
Practice: Estimate systolic blood pressure for the following persons: Person 1: BMI=27.9; age=54; female; on treatment for hypertension Person 2: BMI=34.9; age=66; male; on treatment for hypertension Person 3: BMI=24.8; age=47; female; not on treatment for hypertension y1 = y2 = y3 =
Practice: Estimate systolic blood pressure for the following persons: Person 1: BMI=27.9; age=54; female; on treatment for hypertension Person 2: BMI=34.9; age=66; male; on treatment for hypertension Person 3: BMI=24.8; age=47; female; not on treatment for hypertension y1 = 68.15 + 0.58(27.9) + 0.65(54) + 0.94(0) + 6.44(1) = 125.9 y2 = 68.15 + 0.58(34.9) + 0.65(66) + 0.94(1) + 6.44(1) = 138.7 y3 = 68.15 + 0.58(27.9) + 0.65(54) + 0.94(0) + 6.44(0) = 113.1
Framingham Risk Calculation (10-Year Risk): Dependent Variable: 10-year risk of CVD Independent Variables: Age, gender, total cholesterol, HDL cholesterol, smoker, systolic BP On medication for BP http://hp2010.nhlbihin.net/atpiii/calculator.asp
SECTION 6.5 SPSS for linear regression analysis 37 37 37
Learning Outcome: Analyze and interpret linear regression models using SPSS 38 38
SPSS Analyze Regression Linear Dependent Variable Independent Variable(s) Statistics --- Estimates --- Confidence intervals --- Model fit --- Partial correlations --- Descriptives Example: Dependent variable: HDL Cholesterol Independent variable: BMI
SPSS Analyze Regression Linear Dependent Variable Independent Variable(s) Statistics --- Estimates --- Confidence intervals --- Model fit --- Partial correlations --- Descriptives Example: Dependent variable: HDL Cholesterol Independent variable(s): BMI, gender (1=male, 2=female)
SPSS Analyze Regression Linear Dependent Variable Independent Variable(s) Statistics --- Estimates --- Confidence intervals --- Model fit --- Partial correlations --- Descriptives Example: Dependent variable: HDL Cholesterol Independent variable(s): BMI, gender, age
Practice: Estimate HDL cholesterol levels for the following persons: Person 1: BMI=25.7; female; age=60 Person 2: BMI=36.9; male; age=66 Person 3: BMI=31.8; female; age=51 y1 = y2 = y3 =
Practice: Estimate HDL cholesterol levels for the following persons: Person 1: BMI=25.7; female; age=60 Person 2: BMI=36.9; male; age=66 Person 3: BMI=31.8; female; age=51 y1 = 43.026 – 0.464(25.7) + 10.735(1) + 0.166(60) = 51.8 y2 = 43.026 – 0.464(36.9) + 10.735(0) + 0.166(66) = 36.9 y3 = 43.026 – 0.464(31.8) + 10.735(1) + 0.166(51) = 47.5