230 likes | 383 Views
BMS 617. Lecture 12: Multiple and Logistic Regression. Multiple Regression. In linear regression, we had one independent variable, and one dependent (outcome) variable In lab experiments, this is fairly common
E N D
BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University Genomics Core Facility
Multiple Regression • In linear regression, we had one independent variable, and one dependent (outcome) variable • In lab experiments, this is fairly common • The investigator manipulates the value of one variable and keeps everything else the same • In some lab experiments, and in most observational studies, there is more than one independent variable • Multiple Regression is used for these scenarios • "Multiple Regression" really refers to a collection of different techniques Marshall University School of Medicine
Aims of Multiple Regression • Quantifying the effect of one variable of interest while adjusting for the effects of other variables • Very common in observational studies • The other variables change outside of the control of the investigator • These other variables are often called covariates • Creating an equation which is useful for predicting the value of the outcome variable given the values of the various independent variables • For example, predict the probability of cancer recurrence after surgery alone given characteristics of the tumor (grade, stage, etc) and of the patient (age, height, weight, etc) • Might be used to decide whether or not to use chemotherapy in addition to surgery • Developing a scientific understanding of the impact of several variables on the outcome Marshall University School of Medicine
Types of Multiple Regression • We will look at the following types of multiple regression (there are many others): • Multiple Linear Regression • The dependent variable is a linear function of the independent variables • Logistic Regression • The outcome variable is binary (dichotomous, or categorical with two possible outcomes) • The log odds ratio of the outcome is modeled as a function of the independent variables • Proportional Hazards Regression • Proportional Hazards Regression is used when the outcome is the elapsed time to a non-recurring event • It is effectively used to compute the effect of independent variables on a survival curve Marshall University School of Medicine
Multiple Linear Regression • Multiple Linear Regression finds the linear equation which best predicts an outcome variable, Y, from multiple independent variables X1, X2,…, Xk • Example (from Motulsky): Lead Exposure and Kidney Function • Staessen et al. (1992) investigated the relationship between lead concentration in the blood and kidney function • Kidney function measured by creatinineclearance • Observational study of 965 men • Naive approach would be to measure lead concentration and creatinine clearance and analyze just the two variables • However, kidney function is known to decrease with age, and lead accumulates in the blood over time • Age is a confounding variable • Must account for this Marshall University School of Medicine
Multiple Regression Model The model Staessen et al. used was Yi= β0 + β1Xi,1 + β2Xi,2 + β3Xi,3 + β4Xi,4 + β5Xi,5 + εi where the variables are Marshall University School of Medicine
Multiple Regression Parameters • The β in the equation for the model are the parameters of the model • Do not vary from data point to data point • Are values associated with the population • Will be estimated from the data • Note that one of the variables (Xi,5) is categorical, and we use a “dummy variable” in its place Marshall University School of Medicine
What multiple regression does Multiple linear regression finds values for the parameters that make the model predict the actual data as well as possible Estimates for β0, … β5 are usually denoted b0 … b5 Software performing the regression will report the best estimates for each parameter, a confidence interval and p-value for each estimate, and an R2 value for the model Null hypotheses for the p-values are that the variable provides no information to the model, i.e. that the parameter is zero Marshall University School of Medicine
Interpreting the Co-efficients The coefficients can be interpreted in a similar way to the slope estimate in simple linear regression Represent the change in the dependent variable for one unit increase in the corresponding independent variable, keeping all the other independent variables fixed In the example, b1 (estimate for log(lead concentration)) was -9.5 ml/min, with a 95% CI of [-18.1, -0.9]. This means for every one unit increase in log(lead concentration), creatinine clearance decreased by -9.5 ml/min on average, if all other variables were kept fixed. Marshall University School of Medicine
Interpreting the Coefficients • The coefficients can be interpreted in a similar way to the slope estimate in simple linear regression • Represent the change in the dependent variable for one unit increase in the corresponding independent variable, keeping all the other independent variables fixed • In the example, b1 (estimate for log(lead concentration)) was -9.5 ml/min, with a 95% CI of [-18.1, -0.9]. • This means for every one unit increase in log(lead concentration), creatinine clearance decreased by -9.5ml/min on average, if all other variables were kept fixed. Marshall University School of Medicine
Statistical Significance of the Coefficients • One unit increase in log(lead concentration) means a 10 fold increase in lead concentration • So the average decrease in creatinine clearance corresponding to a 10 fold increase in lead concentration was 9.5 ml/min, and the 95% confidence interval for the decrease was 0.9ml/min to 18.1ml/min. • Since the 95% CI does not contain 0, the p-value for this coefficient must be less than 0.05 • This is the p-value for the null hypothesis that the coefficient is zero • Alternatively think of this as a comparison of models: • Compare the full model (including this variable) to the model not including this variable Marshall University School of Medicine
Interpreting coefficients for “dummy variables” • One of the variables in the model was really a binary variable • Has the subject previously taken diuretics? • Coded as 0 for no and 1 for yes • Estimate for the coefficient for this variable was -8.8ml/min • An increase in one unit for this variable results in a decrease in creatinine clearance of 8.8 ml/min, on average • Since the only values are 0 and 1, this means that participants who has previously taken diuretics had an average creatinine clearance 8.8 ml/min lower than those who had not, if all other variables are held equal Marshall University School of Medicine
Interpreting the R2 value for the model • Multiple linear regression reports an R2 value • For our example, R2 is 0.27 • This means that 27% of the variation in creatinine clearance is accounted for by the model • The remaining 73% is due to random scatter, or is associated with variables not included in the model • Unlike simple linear regression, we cannot plot a graph of the model • One approach to visualizing the model is to plot the predicted outcome variable from the model against the actual measured value Marshall University School of Medicine
Multiple Linear Regression Plot Marshall University School of Medicine
Variable Selection • The authors of the article collected much more data • Stated that other variables did not improve the fit of the model • Adding additional parameters will almost always increase the R2value • Should use the sum-of-squares F test explained earlier to test if there really is an improvement in the model • Beware of overfitting (explained later) Marshall University School of Medicine
Logistic Regression • Logistic Regression is used when the outcome variable is binary • i.e. categorical with two possible outcomes • The general idea is to build a multiple linear model with the outcome variable being the log of the odds ratio • i.e. we build a model predicting the log of the odds of one of the two outcomes from the independent variables • the parameters describe the difference in odds when the variables change by one unit Marshall University School of Medicine
Logistic Regression Example We performed chart reviews on 99 post-menopausal women Ran a logistic regression for an outcome of diabetes with age at menopause, smoking status, and BMI as independent variables Marshall University School of Medicine
Logistic Regression Results Marshall University School of Medicine
Interpreting Logistic Regression Results • The "Model Summary" box describes how well the model fits the data. • -2 Log likelihood is computed from the likelihood of our observed data given the model. Since likelihood must be between 0 and 1, this is always positive and a small value means a better fit. (Our data do not fit the model well.) • R2 cannot be calculated in the same way for logisitic regression. The remaining two values give two alternate approaches, and the interpretation for these is similar to a regular R2. Again, our data do not fit the model well. • The "Classification Table" describes the accuracy of using the model as a predictor. • Use the independent variables to compute the predicted odds, and predict the class based on the most likely • Note that adding more variables will always improve the accuracy; this should really be tested on an independent data set Marshall University School of Medicine
Interpreting the Logistic Regression Parameters The "Variables in the Equation" box gives the parameter estimates, 95% CIs, and p-values The parameter for Smoking is 1.204. This means that a one-unit increase in the smoking variable results in an increase in the log odds ratio of 1.204. Logs here are natural logs; so the increase in odds ratio is e1.204=3.335 fold This is a dummy variable, so a smoker has about 3.3 times the odds of becoming diabetic than a non-smoker The parameter for BMI is 0.072; e0.072=1.075, so an increase of one unit in BMI results in a 1.075-fold increase in the odds ratio of being diabetic. The p-values and 95% CIs show that the parameter for smoking is significant at a significance level of 0.05. BMI has a p-value of 0.055. Marshall University School of Medicine
Mathematical Model for Logistic Regression The mathematical setup for logistic regression is: log(ORi) = β0 + Xi,1 β1 + … + Xi,k βk where the variables are OR: Odds ratio for subject i Xi,j: Value of variable j for subject i For our model, the estimates give log(OR) = -3.307 + 1.208 S + 0.071 B OR = e-3.307 + 1.208 S + 0.071 B OR = e-3.307e1.208 Se0.071 B = 0.037 x 3.347S x 1.073B Marshall University School of Medicine
Summary • Multiple Linear Regression fits a dependent variable as a linear model of multiple independent variables • Provides parameter estimates for each independent variable, along with confidence intervals and p-values • The null hypothesis for the p-value is that the variable doesn't contribute to the model • Used for finding the effect of a variable while correcting for confounding variables • Logistic regression is used when the dependent variable is binary • Models the log odds ratio as a linear function of the dependent variables • Parameters are the increase in log odds ratio per unit increase in the independent variable Marshall University School of Medicine