Analysis of Complex Survey Data

Analysis of Complex Survey Data Day 3: Regression

Today’s schedule • Part I: Basic review of common regressions and when to use them • PART II: Introduction to • PROC REGRESS • PROC RLOGIST • PROC LOGLINK • PROC MULTILOG

Regression • Typically in epidemiologic research, our outcomes fall into four major types: • Continuous • Normally distributed • Skewed • Counts • Binary • Ordinal • Nominal

Continuous outcome, normally distributed • Linear regression

Continuous outcome, right skewed • Poisson regression

Counts • Poisson regression

Binary outcome • Logistic regression

Ordinal • Polytomous regression, cumulative logit link function • Likert scales • Ordered categorical scales (age, income) • The cumulative logit link function assumes that the effect of going from 1 to 2 is the same as the effect of going from 2 to 3

Nominal • Polytomous regression, general logit link function • Race • Diagnosis (depression versus anxiety versus substance use disorder) • The general logit link function gives a different estimate for the effect of going from 1 to 2 and the effect of going from 2 to 3

Categorizing your exposure • Check assumptions regarding the functional form of the relationship between the exposure and the outcome • E.g., relationship between age and alcohol use disorders. We would not want to enter age as a continuous variable because we do not think age is linearly related to risk of alcohol use disorders • If you decide to categorize a continuous variable, decision on cutpoints can best be made if there is literature precedent • Relying on data driven cutpoints will make your work incomparable with other work in the literature • If there is no precedent: • Use quartiles or • Break up the exposure into small categories, and examine the relationship with the outcome in a regression model with no predictors (on the log scale if using logistic regression).

Choosing covariates • Most important: DO NOT SKIP THE GOUNDWORK! • Check associations with exposure and outcome • Check associations among covariates • Categorize the covariates appropriately • When should something be evaluated as a moderator, and when should it be a confounder/covariate? • Most of the time, it is clear: do you think that the relationship between exposure and outcome will be the same across levels of the third variable, or do you think it will be different? • If you do not have an a priori hypothesis and are just trying to build a solid statistical model, try as a moderator first. If significant, leave in as a moderator. • Because interaction terms are sometimes difficult to interpret on their own, think about just creating subset statistical models.

LAB 3: Regression in SUDAAN

Analysis of Complex Survey Data