380 likes | 517 Views
Introduction to L ogistic R egression. Jean-Claude Desenclos, Rachid Salmi, Alain Moren, Thomas Grein. Content. Simple and multiple linear regression Simple logistic regression The logistic function Estimation of parameters Interpretation of coefficients Multiple logistic regression
E N D
Introduction to Logistic Regression Jean-Claude Desenclos, Rachid Salmi, Alain Moren, Thomas Grein
Content • Simple and multiple linear regression • Simple logistic regression • The logistic function • Estimation of parameters • Interpretation of coefficients • Multiple logistic regression • Interpretation of coefficients • Coding of variables • Model building
Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women
SBP (mm Hg) Age (years) adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974
Simple linear regression • Relation between 2 continuous variables (SBP and age) • Regression coefficient b1 • Measures associationbetween y and x • Amount by which y changes on average when x changes by one unit • Least squares method y Slope x
Multiple linear regression • Relation between a continuous variable and a setofi continuous variables • Partial regression coefficients bi • Amount by which y changes on average when xi changes by one unit and all the other xis remain constant • Measures association between xi and y adjusted for all other xi • Example • SBP versus age, weight, height, etc
Multiple linear regression Predicted Predictor variables Response variable Explanatory variables Outcome variable Covariables Dependent Independent variables
General linear models • Family of regression models • Outcome variable determines choice of model • Uses • Estimate force of association between outcome and covariates • Control of confounding • Model building, risk prediction Outcome Model Continuous Linear regression Counts Poisson regression Survival Cox model Binomial Logistic regression
Logistic regression • Models relationship betweenset of variables xi • dichotomous (yes/no) • categorical (social class, ...) • continuous (age, ...) and • dichotomous (binary) variable Y • Dichotomous outcome very common situation in biology and epidemiology
Logistic regression (1) Table 2 Age and signs of coronary heart disease (CD)
How can we analyse these data? • Compare mean age of diseased and non-diseased • Non-diseased: 38.6 years • Diseased: 58.7 years (p<0.0001) • Linear regression?
Logistic regression (2) Table 3Prevalence (%) of signs of CD according to age group
Dot-plot: Data from Table 3 Diseased % Age group
Logistic function (1) Probability ofdisease x
{ logit of P(y|x) Logistic transformation
Advantages of Logit • Properties of a linear regression model • Logit between - and + • Probability (P) constrained between 0 and 1 • Directly related to odds of disease
Interpretation of coefficient b • b=increase in logarithm of odds ratio for a one unit increase in x • Test of the hypothesis that b=0 (Wald test) • Interval testing
Example • Risk of developing coronary heart disease (CD) by age (<55 and 55+ years)
Fitting equation to the data • Linear regression: Least squares • Logistic regression: Maximum likelihood • Likelihood function • Estimates parameters a and b with property that likelihood (probability) of observed data is higher than for any other values • Practically easier to work with log-likelihood
Maximum likelihood • Iterative computing • Choice of an arbitrary value for the coefficients (usually 0) • Computing of log-likelihood • Variation of coefficients’ values • Reiteration until maximisation (plateau) • Results • Maximum Likelihood Estimates (MLE) for and • Estimates of P(y) for a given value of x
Multiple logistic regression • More than one independent variable • Dichotomous, ordinal, nominal, continuous … • Interpretation of bi • Increase in log-odds for a one unit increase in xi with all the other xis constant • Measures association between xi and log-odds adjusted for all other xi
Effect modification • Effect modification • Can be modelled by including interaction terms
Statistical testing • Question • Does model including given independent variable provide more information about dependent variable than model without this variable? • Three tests • Likelihood ratio statistic (LRS) • Wald test • Score test
Likelihood ratio statistic • Compares two nested models Log(odds) = + 1x1 + 2x2 + 3x3 + 4x4 (model 1) Log(odds) = + 1x1 + 2x2 (model 2) • LR statistic -2 log (likelihood model 2 / likelihood model 1) = -2 log (likelihood model 2) minus -2log (likelihood model 1) LR statistic is a 2 with DF = number of extra parameters in model
Example P Probability for cardiac arrest Exc 1= lack of exercise, 0 = exercise Smk 1= smokers, 0= non-smokers adapted from Kerr, Handbook of Public Health Methods, McGraw-Hill, 1998
Interaction between smoking and exercise? • Product term b3 = -0.4604 (SE 0.5332) Wald test = 0.75 (1df) -2log(L) = 342.092 with interaction term = 342.836 without interaction term LR statistic = 0.74 (1df), p = 0.39 No evidence of any interaction
Coding of variables (1) • Dichotomous variables: yes = 1, no = 0 • Continuous variables • Increase in OR for a one unit change in exposure variable • Logistic model is multiplicative OR increases exponentially with x • If OR = 2 for a oneunit change in exposure and x increases from 2 to 5: OR = 2 x 2 x 2 = 23 = 8 • Verify that OR increases exponentially with x. When in doubt, treat as qualitative variable
Continuous variable? • Relationship between SBP>160 mmHg and body weight • Introduce BW as continuous variable? • Code weight as single variable, eg. 3 equal classes: 40-60 kg =0, 60-80 kg =1, 80-100 kg =2 • Compatible with assumption of multiplicative model • If not compatible, use indicator variables
Coding of variables (2) • Nominal variables or ordinal with unequal classes: • Tobacco smoked: no=0, grey=1, brown=2, blond=3 • Model assumes that OR for blond tobacco = OR for grey tobacco3 • Use indicator variables (dummy variables)
Indicator variables: Type of tobacco • Neutralises artificial hierarchy between classes in the variable "type of tobacco" • No assumptions made • 3 variables (3 df) in model using same reference • OR for each type of tobacco adjusted for the others in reference to non-smoking
Model Building • Depends of study objectives : • single hypothesis testing : estimate the effect of a given variable adjusted for all potential available confounders • exploratory study : identify a set of variables independently associated to the outcome • to predict : predict the outcome with the least varieble possible (parsimony principle) • Judgement criteria • conditional vs non conditional (matching ?) • statistical (p) • what is the hypothesis, what is (are) the variable(s) of interest, what are the confounding variables... • any colinear variables… • biologic pathway and plausibilty…
Model building strategy • Hosmer & Lemeshow approach • Start with a saturated model including variables «biologically » important + variables with p<0.2 • Define variables to keep in the model (forced) • Stepwise backward elimination • Exclusion of least « significant variables » base on statitical test and no important variation of coefficient until obtaining a « satifactory » model • Add and test interation terms • Final model with interaction term if any • Check model fit (residual analysis, diagnotic procedures, Lemeshow test…)
Be careful • Make sure you choosed the correct model • Be careful of the black box ! • So easy to do with computers ! • Coding of variables is a fundamental issue • You should know where you want to go (you are the investigator !) • Be careful of automatic procedures • You should be able to justify why you did one way or another • Do not base your stratgy on statistical test • Ask for advice !
Reference++++ • Hosmer DW, Lemeshow S. Applied logistic regression. Wiley & Sons, New York, 1989