Simple Linear Regression: An Introduction

Simple Linear Regression:An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney

Give a man three weapons – correlation, regression and a pen – and he will use all three (Anon, 1978)

An example ID Age Chol (mg/ml) 1 46 3.5 2 20 1.9 3 52 4.0 4 30 2.6 5 57 4.5 6 25 3.0 7 28 2.9 8 36 3.8 9 22 2.1 10 43 3.8 11 57 4.1 12 33 3.0 13 22 2.5 14 63 4.6 15 40 3.2 16 48 4.2 17 28 2.3 18 49 4.0 Age and cholesterol levels in 18 individuals

Read data into R id <- seq(1:18) age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 43, 57, 33, 22, 63, 40, 48, 28, 49) chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0) plot(chol ~ age, pch=16)

Questions of interest • Association between age and cholesterol levels • Strength of association • Prediction of cholesterol for a given age Correlation and Regression analysis

Variance and covariance: algebra • Let x and y be two random variables from a sample of n obervations. • Measure of variability of x and y: variance • Measure of covariation between x and y ? • Algebraically: • var(x + y) = var(x) + var(y) • var(x + y) = var(x) + var(y) + 2cov(x,y) • Where:

Variance and covariance: geometry • The independence or dependence between x and y can be represented geometrically: y h h y H x x h2 = x2 + y2 – 2xycos(H) h2 = x2 + y2

Meaning of variance and covariance • Variance is always positive • If covariance = 0, x and y are independent. • Covariance is sum of cross-products: can be positive or negative. • Negative covariance = deviations in the two distributions in are opposite directions, e.g. genetic covariation. • Positive covariance = deviations in the two distributions in are in the same direction. • Covariance = a measure of strength of association.

Covariance and correlation • Covariance is unit-depenent. • Coefficient of correlation (r) between x and y is a standardized covariance. • r is defined by:

Positive and negative correlation r = 0.9 r = -0.9

Test of hypothesis of correlation • Hypothesis: Ho: r = 0 versus Ho: r not equal to 0. • Standard error of r is: • The t-statistic: • This statistic has a t distribution with n – 2 degrees of freedom. • Fisher’s z-transformation: • Standard error of z: • Then 95% CI of z can be constructed as:

An illustration of correlation analysis Cov(x, y) = 10.68 ID Age Cholesterol (x) (y; mg/100ml) • 46 3.5 • 20 1.9 • 52 4.0 • 30 2.6 • 57 4.5 • 25 3.0 • 28 2.9 • 36 3.8 • 22 2.1 • 43 3.8 • 57 4.1 • 33 3.0 • 22 2.5 • 63 4.6 • 40 3.2 • 48 4.2 • 28 2.3 • 49 4.0 Mean 38.83 3.33 SD 13.60 0.84 t-statistic = 0.56 / 0.26 = 2.17 Critical t-value with 17 df and alpha = 5% is 2.11 Conclusion: There is a significant association between age and cholesterol.

Simple linear regression analysis • Only two variables are of interest: one response variable and one predictor variable • No adjustment is needed for confounding or covariate • Assessment: • Quantify the relationship between two variables • Prediction • Make prediction and validate a test • Control • Adjusting for confounding effect (in the case of multiple variables)

Relationship between age and cholesterol

Linear regression: model • Y : random variable representing a response • X : random variable representing a predictor variable (predictor, risk factor) • Both Y and X can be a categorical variable (e.g., yes / no) or a continuous variable (e.g., age). • If Y is categorical, the model is a logistic regression model; if Y is continuous, a simple linear regression model. • Model Y = a + bX + e a : intercept b : slope / gradient • : random error (variation between subjects in y even if x is constant, e.g., variation in cholesterol for patients of the same age.)

Linear regression: assumptions • The relationship is linear in terms of the parameter; • X is measured without error; • The values of Y are independently from each other (e.g., Y1 is not correlated with Y2) ; • The random error term (e) is normally distributed with mean 0 and constant variance.

Expected value and variance • If the assumptions are tenable, then: • The expected value of Y is: E(Y | x) = a + bx • The variance of Y is: var(Y) = var(e) = s2

Estimation of model parameters Given two points A(x1, y1) and B(x2, y2) in a two-dimensional space, we can derive an equation connecting the points. Gradient: y B(x2,y2) Equation: y = mx + a What happen if we have more than 2 points? dy A(x1,y1) dx a 0 x

Estimation of a and b • For a series of pairs: (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) • Let a and b be sample estimates for parameters a and b, • We have a sample equation: Y* = a + bx • Aim: finding the values of a and b so that (Y – Y*) is minimal. • Let SSE = sum of (Yi – a – bxi)2. • Values of a and b that minimise SSE are called least square estimates.

Criteria of estimation yi Chol Age The goal of least square estimator (LSE) is to find a and b such that the sum of d2 is minimal.

Estimation of a and b • After some calculus operations, the results can be shown to be: • Where: • When the regression assumptions are valid, the estimators of a and b have the following properties: • Unbiased • Uniformly minimal variance (eg efficient)

Goodness-of-fit • Now, we have the equation Y = a + bX + e • Question: how well the regression equation describe the actual data? • Answer: coefficient of determination (R2): the amount of variation in Y is explained by the variation in X.

Partitioning of variations: concept • SST = sum of squared difference between yi and the mean of y. • SSR = sum of squared difference between the predicted value of y and the mean of y. • SSE = sum of squared difference between the observed and predicted value of y. SST = SSR + SSE The the coefficient of determination is: R2 = SSR / SST

Partitioning of variations: geometry SSE SST Chol (Y) SSR mean Age (X)

Partitioning of variations: algebra • Some statistics: • Total variation: • Attributed to the model: • Residual sum of square: • SST = SSR + SSE • SSR = SST – SSE

Analysis of variance • SS increases in proportion to sample size (n) • Mean squares (MS): normalise for degrees of freedom (df) • MSR = SSR / p (where p = number of degrees of freedom) • MSE = SSE / (n – p – 1) • MST = SST / (n – 1) • Analysis of variance (ANOVA) table:

Hypothesis tests in regression analysis • Now, we have Sample data: Y = a + bX + e Population: Y = a + bX + e • Ho: b = 0. There is no linear association between the outcome and predictor variable. • In layman language: “what is the chance, given the sample data that we observed, of observing a sample of data that is less consistent with the null hypothesis of no association?”

Inference about slope (parameter b) • Recall that e is assumed to be normally distributed with mean 0 and variance = s2. • Estimate of s2 is MSE (or s2) • It can be shown that • The expected value of b is b, i.e. E(b) = b, • The standard error of b is: • Then the test whether b = 0 is: t = b / SE(b) which follows a t-distribution with n-1 degrees of freedom.

Confidence interval around predicted valued • Observed value is Yi. • Predicted value is • The standard error of the predicted value is: • Interval estimation for Yi values

Checking assumptions • Assumption of constant variance • Assumption of normality • Correctness of functional form • Model stability • All can be conducted with graphical analysis. The residuals from the model or a function of the residuals play an important role in all of the model diagnostic procedures.

Checking assumptions • Assumption of constant variance • Plot the studentized residuals versus their predicted values. Examine whether the variability between residuals remains relatively constant across the range of fitted values. • Assumption of normality • Plot the residuals versus their expected values under normality (Normal probability plot). If the residuals are normally distributed, it should fall along a 45o line. • Correct functional form? • Plot the residuals versus fitted values. Examine whether the residual plot for evidence of a non-linear trend in the value of the residual across the range of fitted values. • Model stability • Check whether one or more observations are influential. Use Cook’s distance.

Checking assumptions (Cont) • Cook’s distance (D) is a measure of the magnitude by which the fitted values of the regression model change if the ith observation is removed from the data set. • Leverage is a measure of how extreme the value of xi is relative to the remaining value of x. • The Studentized residual provides a measure of how extreme the value of yi is relative to the remaining value of y.

Remedial measures • Non-constant variance • Transform the response variable (y) to a new scale (e.g. logarithm) is often helpful. • If no transformation can achieve the non-constant variance problem, use a more robust estimator such as iterative weighted least squares. • Non-normality • Non-normality and non-constant variance go hand-in-hand. • Outliers • Check for accuracy • Use robust estimator

Regression analysis using R id <- seq(1:18) age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 43, 57, 33, 22, 63, 40, 48, 28, 49) chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0) #Fit linear regression model reg <- lm(chol ~ age) summary(reg)

ANOVA result > anova(reg) Analysis of Variance Table Response: chol Df Sum Sq Mean Sq F value Pr(>F) age 1 10.4944 10.4944 114.57 1.058e-08 *** Residuals 16 1.4656 0.0916 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results of R analysis > summary(reg) Call: lm(formula = chol ~ age) Residuals: Min 1Q Median 3Q Max -0.40729 -0.24133 -0.04522 0.17939 0.63040 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 *** age 0.057788 0.005399 10.704 1.06e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3027 on 16 degrees of freedom Multiple R-Squared: 0.8775, Adjusted R-squared: 0.8698 F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Diagnostics: influential data par(mfrow=c(2,2)) plot(reg)

Study on 44 university students Measure body mass index (BMI) Sexual attractiveness (SA) score A non-linear illustration: BMI and sexual attractiveness id <- seq(1:44) bmi <- c(11.00, 12.00, 12.50, 14.00, 14.00, 14.00, 14.00, 14.00, 14.00, 14.80, 15.00, 15.00, 15.50, 16.00, 16.50, 17.00, 17.00, 18.00, 18.00, 19.00, 19.00, 20.00, 20.00, 20.00, 20.50, 22.00, 23.00, 23.00, 24.00, 24.50, 25.00, 25.00, 26.00, 26.00, 26.50, 28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50, 36.00, 36.00) sa <- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2, 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3, 6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7)

Linear regression analysis of BMI and SA reg <- lm (sa ~ bmi) summary(reg) Residuals: Min 1Q Median 3Q Max -2.54204 -0.97584 0.05082 1.16160 2.70856 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.92512 0.64489 7.637 1.81e-09 *** bmi -0.05967 0.02862 -2.084 0.0432 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.354 on 42 degrees of freedom Multiple R-Squared: 0.09376, Adjusted R-squared: 0.07218 F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323

BMI and SA: analysis of residuals plot(reg)

BMI and SA: a simple plot par(mfrow=c(1,1)) reg <- lm(sa ~ bmi) plot(sa ~ bmi, pch=16) abline(reg)

Re-analysis of sexual attractiveness data # Fit 3 regression models linear <- lm(sa ~ bmi) quad <- lm(sa ~ poly(bmi, 2)) cubic <- lm(sa ~ poly(bmi, 3)) # Make new BMI axis bmi.new <- 10:40 # Get predicted values quad.pred <- predict(quad,data.frame(bmi=bmi.new)) cubic.pred <- predict(cubic,data.frame(bmi=bmi.new)) # Plot predicted values abline(reg) lines(bmi.new, quad.pred, col="blue",lwd=3) lines(bmi.new, cubic.pred, col="red",lwd=3)

Some comments: Interpretation of correlation • Correlation lies between –1 and +1. A very small correlation does not mean that no linear association between the two variables. The relationship may be non-linear. • For curlinearity, a rank correlation is better than the Pearson’s correlation. • A small correlation (eg 0.1) may be statistically significant, but clinically unimportant. • R2 is another measure of strength of association. An r = 0.7 may sound impressive, but R2 is 0.49! • Correlation does not mean causation.

Some comments: Interpretation of correlation • Be careful with multiple correlations. For p variables, there are p(p – 1)/2 possible pairs of correlation, and false positive is a problem. • Correlation can not be inferred directly from association. • r(age, weight) = 0.05; r(weight, fat) = 0.03; it does not mean that r(age, fat) is near zero. • In fact, r(age, fat) = 0.79.

Some comments: Interpretation of regression • The fitted line (regression) is only an estimated of the relation between these variables in the population. • Uncertainty associated with estimated parameters. • Regression line should not be used to make prediction of x values outside the range of values in the observed data. • A statistical model is an approximation; the “true” relation may be nonlinear, but a linear is a reasonable approximation.

Some comments: Reporting results • Results should be reported in sufficient details: nature of response variable, predictor variable; any transformation; checking assumptions, etc. • Regression coefficients (a, b), their associated standard errors, and R2 are useful summary.

Some final comments • Equations are the cornerstone on which the edifice of science rests. • Equations are like poems, or even an onion. • So, be careful with your building of equations!

Simple Linear Regression: An Introduction

Simple Linear Regression: An Introduction

Presentation Transcript

Introduction to simple linear regression

Simple Linear Regression

I. Introduction: Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple linear regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple linear regression

Simple Linear Regression