POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from

POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)

Which method does not control for confounding? • Stratification • Exclusion criteria • Regression modelling • Objective assessment of outcomes

Observational epidemiology Usually in epidemiology studies are “observational” Myriad factors determine the occurrence of disease Trying to elicit the effects of specific factors from others (confounding variables) is often difficult Regression models (as alternative to stratification) are useful

Stratification difficulty • Often too many confounders – stratifying leads to too many strata e.g. 4 categories of AGE, 2 of sex, 4 of ethnicity = 32 strata. Empty cells problematic • Need better, more statistically efficient, way to deal with problem • Want to control for (many) possible confounding variables while eliciting effect (relative risk or odds ratio) of an exposure of interest • Building statistical models is one solution

What is a [statistical regression] model? Usually regarded as a formula that relates an outcomeY to one or more predictors (exposures) X1 X2….of Y The formula imposes a framework that we assume is the way we think Y is related to X1 , X2 ,…. in the real world Model is specified as unknown parameters estimated from data – ‘model fitting’.

Linear (regression) model Often may consider “Y increases with X” e.g blood pressure increases with age May also consider it does so “linearly” Data seems to support this Though with much variability.

Straight line model (simple linear regression model) for how Y “depends on X” Y This is the model structure, framework, Fitting to data involves drawing a “good fitting” line through the points – line gives mean Y for given X [E(Y|X)] 0 X

Regression Relationship between X and Y Y “depends” on X (rather than X depends on Y) Y is dependent (outcome, disease) variable X is independent (exposure, predictor, covariate) variable

Consider 2 potential predictors of Y, say X1 X2 Can plot data scatter points in a 3-dimensional space: Y X1 X2

Analog of a line in 2-D is a plane in 3-D: Y X1 X2

Straight line model with just X1 is Extending this to a plane is Or further Here E(Y|…) means “average Y given ….”

BinaryY in epidemiologyIn epi, Y is often a binary disease/no disease outcomeX1 X2 etc are risk factors for the disease. One of which may be an exposure of interest, the others confounders.

logistic model: need to modify to account for binary Y, occurrence of disease D • Again information on X1 X2…collapsed into a risk score • relationship between probability of disease and Q follows now follows logistic formula:

Logistic regression formula is of the form Probability of CHD=eQ/(1+eQ) where Q is a weighted average of risk factors (a linear score). For example: Q= -5.31+1.09*SMOKE+ 0.41*SEX and SEX=1 if man, 0 if woman SMOKE=1 if smokes , 0 if no The values -5.31, 1.09, 0.41 are estimated from the data and are the “beta-coefficients”.

The model gives a probability for each of the 4 combinations: Smoking man has probability Q=-5.32 + 1.09 x 1 +0.41 x 1= -3.82 Prob = e-3.82/(1+e-3.82)=0.0214 Nonsmoking manQ= -5.32+1.09x0 + 0.41 x 1= -4.91, Prob=0.00732 Nonsmoking womanQ=-5.32, Prob=0.00486 Smoking womanQ=-5.32+1.09, Prob=0.01434

Relative risk estimates RR for smoking (in men) is: 0.0214/0.00732=2.92 RR for smoking (in women) is: 0.01434/0.00486=2.95 Notice these are also approximately e1.09=2.97 i.e take exponential of beta-coefficient of variable estimates its RR (actually e1.09=2.97 is the disease odds ratio, but approx equal to RR when disease is rare)

Why logistic formula? Ans: P(D|…) always between 0 and 1 whatever value of Q i.e. behaves like a probability should.

Can include as many variables in Q as we like: Q=-5.45 +1.23SMOKE+0.31SEX+.124AGE -0.2ETHNIC … but model may be too ambitious. i.e. Can a single model be expected to really accurately account for effects of numerous variables?

Logistic model in epidemiology: controlling for confounding Y is occurrence of disease on a cohort study X1 is binary exposure of interest X2 X3 … are confounding risk factors b1 is effect of X1“controlling” for effects of X2 X3 etc

Relative risk In fact is approximately relative risk of X1 (assumed to be same for all values of X2 X3) - no effect modification/interaction RR is (assumed) same whatever X2, X3, etc e.g. if X1 is smoking, X2age, X3is alcohol RR for smoking is same whatever age, alcohol

Case-control studies Development is for cohort studies (since probs of disease P( D | …) are estimable in a cohort study)…. ….but can use for case-control studies too (even though probs of disease are not estimable) Can still use as RR estimate.

Logistic Modelling advantages Can adjust for many confounders at once beta coefficients give odds ratio estimates of relative risk, valid if disease rare deals with “interactions” (effect modification) if necessary easy to do on computer gives confidence intervals, P-values etc can apply to case-control data

Disadvantages Model is just a model - not necessarily reality Black box approach, can lose touch with data Requires decisions: what variables in model? How to code variables? Continuous or dichotomised? ORs not valid as RR for non-rare disease (in cohort)

Logistic regression is favoured in epidemiology because: • It can be used to adjust for many confounders at once • It enhances statistical power over stratification • It results in an outcome that is constrained between one and zero (the domain of a probability).

How do you estimate an odds ratio from a logistic model? • It is equal to the beta coefficient • It is equal to the exponential of the beta coefficient • It is equal to the logit of the sum of the product of the variables and the beta coefficients.

Which one of the following statements are true? • The choice of independent and dependent variable in regression modelling is unimportant • A regression model estimates the average value of the dependent variable, given the values of a number of independent variables • Independent variables are outcomes and dependent variables exposures.

POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from