BMS 617

BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University Genomics Core Facility

Multiple Regression • In linear regression, we had one independent variable, and one dependent (outcome) variable • In lab experiments, this is fairly common • The investigator manipulates the value of one variable and keeps everything else the same • In some lab experiments, and in most observational studies, there is more than one independent variable • Multiple Regression is used for these scenarios • "Multiple Regression" really refers to a collection of different techniques Marshall University School of Medicine

Aims of Multiple Regression • Quantifying the effect of one variable of interest while adjusting for the effects of other variables • Very common in observational studies • The other variables change outside of the control of the investigator • These other variables are often called covariates • Creating an equation which is useful for predicting the value of the outcome variable given the values of the various independent variables • For example, predict the probability of cancer recurrence after surgery alone given characteristics of the tumor (grade, stage, etc) and of the patient (age, height, weight, etc) • Might be used to decide whether or not to use chemotherapy in addition to surgery • Developing a scientific understanding of the impact of several variables on the outcome Marshall University School of Medicine

Types of Multiple Regression • We will look at the following types of multiple regression (there are many others): • Multiple Linear Regression • The dependent variable is a linear function of the independent variables • Logistic Regression • The outcome variable is binary (dichotomous, or categorical with two possible outcomes) • The log odds ratio of the outcome is modeled as a function of the independent variables • Later in the course we will study survival curves, and look at multiple regression in this context Marshall University School of Medicine

Multiple Linear Regression • Multiple Linear Regression finds the linear equation which best predicts an outcome variable, Y, from multiple independent variables X1, X2,…, Xk • Example (from Motulsky): Lead Exposure and Kidney Function • Staessen et al. (1992) investigated the relationship between lead concentration in the blood and kidney function • Kidney function measured by creatinineclearance • Observational study of 965 men • Naive approach would be to measure lead concentration and creatinine clearance and analyze just the two variables • However, kidney function is known to decrease with age, and lead accumulates in the blood over time • Age is a confounding variable • Must account for this Marshall University School of Medicine

Multiple Regression Model The model Staessen et al. used was Yi= β0 + β1Xi,1 + β2Xi,2 + β3Xi,3 + β4Xi,4 + β5Xi,5 + εi where the variables are Marshall University School of Medicine

Multiple Regression Parameters • The β in the equation for the model are the parameters of the model • Do not vary from data point to data point • Are values associated with the population • Will be estimated from the data • Note that one of the variables (Xi,5) is categorical, and we use a “dummy variable” in its place Marshall University School of Medicine

What multiple regression does Multiple linear regression finds values for the parameters that make the model predict the actual data as well as possible Estimates for β0, … β5 are usually denoted b0 … b5 Software performing the regression will report the best estimates for each parameter, a confidence interval and p-value for each estimate, and an R2 value for the model Null hypotheses for the p-values are that the variable provides no information to the model, i.e. that the parameter is zero Marshall University School of Medicine

Interpreting the Coefficients • The coefficients can be interpreted in a similar way to the slope estimate in simple linear regression • Represent the change in the dependent variable for one unit increase in the corresponding independent variable, keeping all the other independent variables fixed • In the example, b1 (estimate for log(lead concentration)) was -9.5 ml/min, with a 95% CI of [-18.1, -0.9]. • This means for every one unit increase in log(lead concentration), creatinine clearance decreased by -9.5ml/min on average, if all other variables were kept fixed. Marshall University School of Medicine

Statistical Significance of the Coefficients • One unit increase in log(lead concentration) means a 10 fold increase in lead concentration • So the average decrease in creatinine clearance corresponding to a 10 fold increase in lead concentration was 9.5 ml/min, and the 95% confidence interval for the decrease was 0.9ml/min to 18.1ml/min. • Since the 95% CI does not contain 0, the p-value for this coefficient must be less than 0.05 • This is the p-value for the null hypothesis that the coefficient is zero • Alternatively think of this as a comparison of models: • Compare the full model (including this variable) to the model not including this variable Marshall University School of Medicine

Interpreting coefficients for “dummy variables” • One of the variables in the model was really a binary variable • Has the subject previously taken diuretics? • Coded as 0 for no and 1 for yes • Estimate for the coefficient for this variable was -8.8ml/min • An increase in one unit for this variable results in a decrease in creatinine clearance of 8.8 ml/min, on average • Since the only values are 0 and 1, this means that participants who has previously taken diuretics had an average creatinine clearance 8.8 ml/min lower than those who had not, if all other variables are held equal Marshall University School of Medicine

Interpreting the R2 value for the model • Multiple linear regression reports an R2 value • For our example, R2 is 0.27 • This means that 27% of the variation in creatinine clearance is accounted for by the model • The remaining 73% is due to random scatter, or is associated with variables not included in the model • Unlike simple linear regression, we cannot plot a graph of the model • One approach to visualizing the model is to plot the predicted outcome variable from the model against the actual measured value Marshall University School of Medicine

Multiple Linear Regression Plot Marshall University School of Medicine

Variable Selection • The authors of the article collected much more data • Stated that other variables did not improve the fit of the model • Adding additional parameters will almost always increase the R2value • Should use the sum-of-squares F test explained earlier to test if there really is an improvement in the model • Beware of overfitting (explained later) Marshall University School of Medicine

Logistic Regression • Logistic Regression is used when the outcome variable is binary • i.e. categorical with two possible outcomes • The general idea is to build a multiple linear model with the outcome variable being the log of the odds ratio • i.e. we build a model predicting the log of the odds of one of the two outcomes from the independent variables • the parameters describe the difference in odds when the variables change by one unit Marshall University School of Medicine

Logistic Regression Example We performed chart reviews on 99 post-menopausal women Ran a logistic regression for an outcome of diabetes with age at menopause, smoking status, and BMI as independent variables Marshall University School of Medicine

The logistic regression model In logistic regression, we model the odds of the outcome as the product of the odds due to each independent variable In our example:(odds of diabetes) = (baseline odds) × (odds due to age at menopause) × (odds due to smoking) × (odds due to BMI) Marshall University School of Medicine

Odds in the logistic model • The baseline odds will be our prediction for the odds of getting diabetes when all the variables are zero • Note this might not be particularly meaningful for some variables • The other odds specify how much the odds of diabetes change if we know the values of a particular variable • How much do the odds of diabetes increase/decrease for each additional year of age at menopause? • How much do the odds of diabetes increase/decrease for smokers over non-smokers? • How much do the odds of diabetes increase/decrease for each additional unit (i.e. each additional Kg/m2) of BMI? • These are the odds ratios for the independent variables Marshall University School of Medicine

Making the logistic regression model look similar to multiple regression • We can model the logarithm of the odds of the outcome as Yi = β0 + β1X1,i + β2X2,i + β3X3,i • Here: • Yiis the log of the odds of diabetes for patient i • X1,iis the age at menopause of patient i • X2,iis the smoking status of patient i (1 for smoker, 0 for non-smoker) • X3,i is the BMI of patient i • The parameters are • β0the log of the baseline odds • β1 the log of the odds ratio for age at menopause • β2 the log of the odds ratio for smoking • β3 the log of the odds ratio for BMI Marshall University School of Medicine

Fitting the logistic regression model • Fitting the logistic regression model works differently to fitting a linear model • It does not use least squares • We do not have an “odds of the outcome” to plug in for each patient • Just a “diabetes” or “no diabetes” • Fitting the model instead uses a “maximum likelihood” strategy • Given any possible parameter values, we can compute how likely it would be to see the data we got • Choose the parameters which make this likelihood the highest • This is mathematically very complex; typically numerical estimation methods are used Marshall University School of Medicine

Logistic Regression Results Marshall University School of Medicine

Interpreting Logistic Regression Results • The "Model Summary" box describes how well the model fits the data. • -2 Log likelihood is computed from the likelihood of our observed data given the model. Since likelihood must be between 0 and 1, this is always positive and a small value means a better fit. (Our data do not fit the model well.) • R2 cannot be calculated in the same way as for logistic regression. The remaining two values give two alternate approaches, and the interpretation for these is similar to a regular R2. Again, our data do not fit the model well. • The "Classification Table" describes the accuracy of using the model as a predictor. • Use the independent variables to compute the predicted odds, and predict the class based on the most likely • Note that adding more variables will always improve the accuracy; this should really be tested on an independent data set Marshall University School of Medicine

Interpreting the Logistic Regression Parameters The "Variables in the Equation" box gives the parameter estimates, 95% CIs, and p-values The parameter for Smoking is 1.204. This means that a one-unit increase in the smoking variable results in an increase in the log odds ratio of 1.204. Logs here are natural logs; so the increase in odds ratio is e1.204=3.335 fold This is a dummy variable, so a smoker has about 3.3 times the odds of becoming diabetic than a non-smoker The parameter for BMI is 0.072; e0.072=1.075, so an increase of one unit in BMI results in a 1.075-fold increase in the odds ratio of being diabetic. The p-values and 95% CIs show that the parameter for smoking is significant at a significance level of 0.05. BMI has a p-value of 0.055. Marshall University School of Medicine

Equationfor Logistic Regression Results Our model here is Yi = β0 + β1X1,i + β2X2,i + β3X3,i For thismodel, the estimates give log(OR) = -3.307 + 0.009M + 1.208 S + 0.071 B OR = e-3.307 + 0.009M + 1.208 S + 0.071 B OR = e-3.307e0.009Me1.208 Se0.071 B= 0.037 x 1.009M × 3.347S x 1.073B Marshall University School of Medicine

Summary • Multiple Linear Regression fits a dependent variable as a linear model of multiple independent variables • Provides parameter estimates for each independent variable, along with confidence intervals and p-values • The null hypothesis for the p-value is that the variable doesn't contribute to the model • Used for finding the effect of a variable while correcting for confounding variables • Logistic regression is used when the dependent variable is binary • Models the log odds ratio as a linear function of the dependent variables • Parameters are the increase in log odds ratio per unit increase in the independent variable Marshall University School of Medicine

BMS 617

BMS 617

Presentation Transcript

API 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS

642-617

Falcon BMS

BMS 617

BMS 617

617

Bms