STA 106: Correlation and Linear Regression

STA 106: Correlation and Linear Regression Lecturer: Dr. Daisy Dai Department of Medical Research

Contents • Correlation • Regression • Simple Regression • Multiple Regression

What is correlation? • Correlation and linear regression are techniques for dealing with the relationship between two or more continuous variables. • In correlation we are looking for a linear association between two variables, and the strength of the association is summarized by the correlation coefficient (r) or coefficient of determination (r2)

Case Study: Anemia in Women • A survey was conduct to a sample of 20 anemia women, randomly selected from a pre-defined geographical area. The participants had a blood sample taken and their hemoglobin (Hb) level and packed cell volume (PCV) measured. They were also asked their age, and whether or not they had experienced the menopause. • The goals of the study were to determine whether Hb affects PCV or the other way around or whether Hb was associated with age.

Data

Scatter Plots

Correlation Coefficient • Pearson product-moment correlation coefficient, also known as r, R, or Pearson's r, is a measure of the strength of the linear relationship between two variables that is defined in terms of the (sample) covariance of the variables divided by their (sample) standard deviations Karl Pearson (1857 – 1936)

Formula

Some properties of Correlation Coefficient • The sample and population Pearson correlation coefficient, r, ranges between -1 and 1. • The absolute value of r stands for the strength of the correlation. • The sign of r stands for the direction of the relationship. For r>0, two variables changes in the same direction. For r<0, two variables are inversely related.

Coefficient of Determination • The coefficient of determination, r2 , is the proportion of variation in the observed values of the response variable explained by the regression. Coefficient of Determination (r2)=square of Correlation Coefficient (r) • The coefficient of determination always lies between 0 and 1 and is a descriptive measure of the utility of the regression equation for making predictions. A value of near 0 indicates that the regression equation is not very useful for making predictions, whereas a value of near 1 indicates that the regression equation is extremely useful for making predictions.

Some figures of Coefficient of Determination

When not to use the correlation coefficient

Interpretation of the size of a correlation

What is Regression? • Regression are methods to identify the associations between the outcome variable and explanatory variables. The value of the outcome variable can be predicted by the values of explanatory variables. • The outcome variable, also called dependent variable, is listed in the left side of regression models. The explanatory variable(s), also called independent, variable, stay in the right side of regression model. Outcome variable  explanatory variable(s) The birth weight = 0.2 +0.4 * Gestational age • The relationship is summarized by a regression equation consisting of a slope and an intercept. An intercept is the constant. The slope reflects the change of change in the outcome variable with respect to the explanatory variable.

Outcome variable Dependent variable Note: These are the phenomena we want to interpret the variation and predict. For instance, response to treatment etc. Explanatory variable Independent variable Risk factors Note: These are the variables that can be used to explain the variation in the outcome variables. For instance, demographics, environmental factors, genetic factors, medical educational intervention. The following terminologies are used interchangeably

Case Study: Orion Cars • To find the association between the age and price of Orion cars and predict the price by age, we randomly recorded 11 Orions and list data in the following table.

Orion Car Data

Results • Describe the apparent relationship between age and price of Orions. Because the slope of the regression line is negative, price tends to decrease as age increases • Interpret the slope of the regression line in terms of prices for Orions. Orions depreciate an estimated $2026 per year, at least in the 2- to 7- year-old range. • Use the regression equation to predict the price of a 3-year-old Orion and a 4-year-old Orion.

Simple vs. Multiple Regression • The regression involving one independent variable is called simple linear regression. • Outcome variable  one explanatory variable • y=a + b * x + error, where a is intercept and b is slope. When b=0, y is independent on x (i.e. x and y are not correlated). When b>0, x and y have positive relationship. When b<0, x and y have negative/inverse relationship. • Height = 0.2 + 0.4 *weight • The regression involving a set of independent variables is called multiple regression. • Outcome variable  a set of explanatory variable • y=a + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4+…+error • Weight =0.2 + 0.4*height +0.3*age

The Regression Equation • Least-Squares criterion: The straight line that best fits a set of data points is the one having the smallest possible sum of squared errors. • Regression line: The straight line that best fits a set of data points according to the least-square criterion. • Regression equation: The equation of the regression line.

Sum of Squares in Regression • Total sum of squares, SST: The variation in the observed values of the response variable: • Regression sum of squares, SSR: The variation in the observed values of the response variable explained by the regression: • Error sum of squares, SSE: The variation in the observed values of the response variable not explained by the regression:

The three sums of squares, SST, SSR, and SSE, can be obtained by using the following computing formulas: Total sum of squares; SST= Regression sum of squares: SSR= Error sum of square: SSE= Regression identity: SST=SSR+SSE

Case Study: Anemia in Women • A random sample of 20 anemia women, from a pre-defined geographical area, were investigated by a survey. They had a blood sample taken and their hemoglobin (Hb) level and packed cell volume (PCV) measured. They were also asked their age, and whether or not they had experienced the menopause. • The goal of the study is to determine whether Hb affects PCV or the other way around.

Data

Outliers • An outlier is a point that lies far from the regression line. Such points may represent measuring error, or may indicate heterogeneity in sampling. • An outlier may skew the direction of the regression line and increase the variation in the data. • Outliers need to be removed from analysis.

Influential Observations • Influential observations are the points far from the other data in the horizontal direction. • Influential observations may have a significant impact on the slope of the regression line. • One need to compare the fitted model with influential observations vs. the fitted model without influential observations and identify the reasons of influential observations. • Decide whether influential points need to be removed from studies.

Residuals • Residual is the discrepancy between the observed value and the predicted value. • A residual plot is an useful diagnostic tool to check model assumption and to detect outliers.

Extrapolation • Whenever a linear regression model is fit to a group of data, the range of the data should be carefully observed. Attempting to use a regression equation to predict values outside of this range is often inappropriate, and may yield incredible answers. • Consider, for example, a linear model which relates weight gain to age for young children. Applying such a model to adults, or even teenagers, would be absurd, since the relationship between age and weight gain is not consistent for all age groups.

Correlation is not causation • One of the most common errors in the medical literature is to assume that simply because two variables are correlated, therefore one causes the other. Amusing examples include the positive correlation between the mortality rate in Victorian England and the number of Church of England marriages, and the negative correlation between monthly deaths from ischemic heart disease and monthly ice-scream sales. In each case there, the fallacy is obvious because all the variables are time-related. In the former example, both the mortality rate and the number of Church of England marriages went down during the 19th century, in the latter example, deaths from ischemic heart disease are higher in winter when ice-cream sales are at their lowest. However, it is always worth trying to think of other variables, confounding factors, which may be related to both of the variables under study.

Points when performing correlation or regressions • Plot the data to see whether the relationship is likely to be linear. • Is the variables normally distributed? If not, consider transformation of variable or switching to other models. • Correlation does not necessarily imply causation. • Think about confounding factors. If a significant correlation is obtained and the causation inferred, could there be a third factor, not measured, which is jointly correlated with the other two, and so accounts for their association? • If a scatter plot is given to support a linear regression, is the variability of the points about the line roughly the same over the range of the independent variable? If not, then perhaps some transformation of the variables is necessary before computing the regression line. • If predictions are given, are any made outside the range of the observed values of the independent variable? • Outliers need to be removed from analysis.

Software • The open source correlation coefficient calculator: http://www.easycalculation.com/statistics/correlation.php • We will offer a SPSS workshop for correlation, linear and logistic regression analysis in April.

References • Designing Clinical Research3rd edition by Hulley et al. • Medical Statistics by Campbell et al.

STA 106: Correlation and Linear Regression