Multiple Regression

Multiple Regression

Control of Confounding Variables • Randomization • Matching • Adjustment • Direct • Indirect • Mantel-Haenszel • Multiple Regression • Linear • Logistic • Poisson • Cox Stratified methods

Limitations of the Stratified Methods • Can study only one independent variable at a time • Problematic when there are too many variables to adjust for (too many strata) • Limited to categorical variables (if continuous, can categorize, which may result in residual confounding)

How to Investigate Associations Between Variables? • Between two categorical variables: • Contingency table, odds ratio, χ2 • Between a categorical and a continuous variable: • Compare means, t test, ANOVA • Between two continuous variables • Example:

Relationship between air pollution and health status

120 100 20 40 80 100 120 20 40 80 60 60 Scatter Plot of health status by pollution level in 20 geographic areas Health status                     0 Pollution level

Suppose we now wish to know whether our two variables are linearly related • The question becomes: • Are the data we observed compatible with the two variables being linearly related? That is, • Is the true association between the two variables defined by a straight line, and the scatter we see just random error around the truth?

120 100 20 40 80 100 120 20 40 80 60 60 Scatter Plot of health status by pollution level in 20 geographic areas Health status                     r= ? 0 Pollution level

120 100 20 40 80 100 120 20 40 80 60 60 Scatter Plot of health status by pollution level in 20 geographic areas Health status                     r0.7 0 Pollution level

Then, the next practical question in our evaluation of whether the relationship is linear: • How can the fit of the data to a straight line be measured? • Correlation Coefficient (Pearson): the extent to which the two variables vary together • Linear Regression Coefficient: most useful when we wish to know the strength of the association

Correlation Coefficient (Pearson) R: ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation) • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • r= 1.0 r= -0.8 r= 0

Linear Regression Coefficient of a Straight Line 1 1 unit 1 Linear regression coefficient : increase in y per unit increase in x : expresses strength of the association : allows prediction of the value of y, given x y 0 0 x 0 x= 0, y=

100 120 80 40 20 60 20 40 80 100 120 60 The trick is to find the “line” (0, 1) that best fits the observed data Y= Health status  In linear regression, least square approach estimates the line that minimizes the square of the distance between each point and the line                    0 X= Pollution level Health status= 0 + 1 (pollution) Health status= 30.8 + 0.71(pollution)

Simple Linear Regression • The “points” (observations) can be individuals, or conglomerates of individuals (e.g., regions, countries, families) in ecologic studies. • When X is inversely related to Y, b () is negative. Note: when estimating  from samples, the notation “b” is used instead of 

In epidemiologic studies, the value of the intercept (b0 or 0) is frequently irrelevant (X=0 is meaningless for many variables) • E.g. Relationship of weight (X) to systolic blood pressure (Y): • SBP(mmHg) • • • • 200 • • • • • • • • • • • • • • • • 100 • 0 100 150 200 50 ? WEIGHT (Lb)

100 120 80 40 20 60 20 40 80 100 120 60 FUNDAMENTAL ASSUMPTION IN THE LINEAR MODEL: X and y are linearly related, i.e., the increase in y per unit increase of x () is constant across the entire range of x. E.g., The increase in health status index between pollution level 40 and 50 is the same as that between pollution level 90 and 100 Y= Health status •                     0 X= Pollution level

Wrong model! FUNDAMENTAL ASSUMPTION IN THE LINEAR MODEL: X and y are linearly related However…if the data look like this: y • • • • “u-shaped” function • • • • • • • • • • • • • • • • • • • • • • • • • x

BOTTOM LINE: LOOK AT THE DATA BEFORE YOU DECIDE ON THE BEST MODEL! - Plot yi vs. xi If non-linear patterns are present: - Use quadratic terms (e.g., age2), logarithmic terms --- e.g., log (x) --- etc. - Categorize and use dummy variables

Other important points to keep in mind • Like any other “sample statistic”, b is subject to error. Formulas to calculate the standard error of b are available in most statistics textbooks. • “Statistical significance” of b (hypothesis testing): • H0: b=0  No association x  y • H1: b=0  x and y are linearly related • Test statistic: Wald statistic (z-value)  b/SE(b) • WARNING: THIS TEST IS ONLY SENSITIVE FOR LINEAR ASSOCIATIONS. A NON-SIGNIFICANT RESULT DOES NOT IMPLY THAT x AND y ARE NOT ASSOCIATED, BUT MERELY THAT THEY ARE NOT LINEARLY ASSOCIATED. • Confidence interval (precision) for b: • 95% CI= b ± 1.96 x SE(b)

The regression coefficient (b) is related to the correlation coefficient (r), but the former is generally preferable because: • It gives some sense of the strength of the association, not only the extent to each two variables vary concurrently in a linear fashion. • It allows prediction of Y as a function of X.

Note: functions having different slopes may have the same correlation coefficient Correlation Coefficient (Pearson) R: ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation) • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

The equation: Naturally extends to multiple variables (multidimensional space):

y x1 x2 Multiple regression coefficients: b1 -- increment in y per unit increment in x1, after the effect of x2 on y and on x1 has been removed, or -- effect of x1 on y, adjusted for x2 b2 – increment in y per unit increment in x2, after the effect of x1 on y and on x2 has been removed, or -- effect of x2 on y, adjusted for x1 (b0 – value of y when both x1 and x2 are equal to 0)

Exposed: x1= 1 (smoker) Unexposed: x1= 0 (non-smoker) y: lung cancer Confounder present x2= 1 (drinker) Confounder absent x2= 0 (non-drinker) = 1 = 1 = 1 = 0 = 0 ARexp = b1 = 0 = 1 Same! (no interaction) = 0 =0 - = 0 b1 ARexp = -

y Interaction term x1 x2 Multiple regression coefficients: b1 -- increment in y per unit increment in x1 in individuals not exposed to x2 b2 – increment in y per unit increment in x2 in individuals not exposed to x1 b3 – increment in y per unit increment in the joint presence of x1 and x2, compared to individuals not exposed to x1 and x2

Multiple Linear Regression Notes • To obtain least square estimates of b’s, need to use matrix algebra…or computers! • Important assumptions: • Linearity • No (additive) interaction, i.e., • The absolute effect of x1 is independent of x2, or • The effects of x1 and x2 are merely additive (i.e., not “less, or more than additive”) • NOTE: if there is interaction, product terms can be introduced in the model to account for it (it is, however, better to do stratified analysis) • Other assumptions: • Observations (i’s) are independent • Homoscedasticity: variance of y is constant across x-values • Normality: for a given value of x, values of y are normally distributed

In Linear Regression (simple or multiple), Independent Variables (x’s) can be: • Continuous • Pollution level (score) • BMI (kg/m2) • Blood pressure (mmHg) • Age (years) • Categorical • Dichotomous (conventionally, one of the values is coded as “1” and the other, as “0”) • Gender (male/female) • Treatment (yes/no) • Smoking (yes/no) • Ordinal • Any continuous variables categorized in percentiles (tertiles, quartiles, etc)

In Linear Regression (simple or multiple), the Dependent Variable (y) can be: • Discrete (yes/no) • Incident cancer • Recurrent cancer • Continuous • Systolic blood pressure (mmHg) • Serum cholesterol (mg/dL) • BMI (kg/m2)

Average difference (regression coefficient or slope = b1) Unit: from zero to 1 When x= 1, - When x= 0, Example of x as a discrete variable (obesity) and y as a continuous variable (systolic blood pressure, mmHg) 160   150    140    130       120    110  0 1 Obesity (x= 0 if “no”; x=1 if “yes”) b1 SBP  = Thus, b1 = increase in SBP per unit increase in obesity = average difference in SBP between “obese” and “non-obese” individuals

Example of x as a discrete variable with more than 2 categories (e.g., educational level) and y as a continuous variable (systolic blood pressure (mmHg) • Ordinal variables (x’s) can be entered into the regression equation as single x’s. Example: • Where education is categorized into “low”, “medium” and “high”. • Thus, x1= 1 when “low”, x1=2 when “medium” and x1=3 when “high”

b1 b1 same  160   150     140         130     120     110  Low Medium High Educational Level HOWEVER, the model assumes that the difference in SBP (decrease) is the same between “low” (x1= 1) and “medium” (x1= 2), as that between “medium” (x1= 2) and “high” (x1= 3) assumption of linearity Alternative: it’s coming!

Non-ordinal multilevel categorical variable • Race (Asian, Black, Hispanic, White) • Treatment (A, B, C, D) • Smoking (cigarette, pipe, cigar, nonsmoker) How to include these variables in a multiple regression model? “Dummy” or indicator variables: Define the number of dummy dichotomous variables as the number of categories minus one

Use of dummy variables Where X1= 1 if Asian, x1= 0 if otherwise X2= 1 if Black, x2= 0 if otherwise X3= 1 if Hispanic, x3= 0 if otherwise Example: “Race” categorized as Asian, Black, Hispanic and White. Thus, to model “race”: Thus, what is the interpretation of b0, b1, b2, and b3?

Definitions of Dummy Variables • b0= average value of y in whites (reference category) • b1= average difference in y between Asians and Whites • b2= average difference in y between Blacks and Whites • b3= average difference in y between Hispanics and Whites

Use of dummy variables when the function is not a straight line

WRONG MODEL!!! SBP 160   150   140    130    120       110     BMI Quintile 1 2 3 4 5

Model Where X1=1 if BMI quintile=2; x1=0 if otherwise X2=1 if BMI quintile=3; x2=0 if otherwise X3=1 if BMI quintile=4; x3=0 if otherwise X4=1 if BMI quintile=5; x4=0 if otherwise SBP 160   150   140    130    120       110     BMI Quintile 1 2 3 4 5 Note: each b represents the difference between each quintile (2, 3, 4 and 5) and the reference quintile (quintile 1). Thus, the difference is negative for 2, slightly negative for 3, and positive for 4 and 5. Can also obtain the difference between quintiles: for example, b4 – b3 is the difference between quintiles 5 and 4

Multiple linear regression models of leukocyte count (thousands/mm3) by selected factors, in never smokers, ARIC study, 1986-89 (Nieto et al, AJE 1992;136:525-37) *Model 1: adjusted for center, education, height, apolipoprotein A-I, glucose and for the other variables shown in the table. **Model 2: Adjusted for the same variables included in Model 1 plus hemoglobin, platelet, uric acid, insulin, HDL, apolipoprotein B, triglycerides, factor VIII, fibrinogen, antithrombin III, protein C antigen and APTT

Control of Confounding Variables • Random allocation • Matching • Individual • Frequency • Restriction • Adjustment • Direct • Indirect • Mantel-Haenszel • MULTIPLE REGRESSION • Linear model • LOGISTIC MODEL

The probability of disease (y) given exposure (x): 1.0 Or, simplifying: 0.5 0 AN ALTERNATIVE TO THE LINEAR MODEL When the dependent variable is dichotomous (1/0) Probability of response (P) Dose (x)

EXPONENTS AND LOGARITHMS: Brief Review Notation: (Note: In most epidemiologic literature, lnA is written as logA)

Logs: Brief Review (Cont.) Example: 100= 1/0.01= 1/10-2= 102= 100

b1 Remember that:   Unit increment in x Log(Odds) b0 x b1= increment in log (Odds) per unit increment in x Thus, b1 is the log of the Odds Ratio!!

Assume prospective data in which exposure (independent variable) is defined dichotomously (x): For exposed (x=1): For unexposed (x=0):

WITH CASE-CONTROL DATA: • Intercept (b0) is uninterpretable • Can obtain unbiased estimates of the regression coefficient (b1) (See Schlesselman, pp. 235-7)

The logistic model extends to the multivariate situation: Interpretation of multiple logistic regression coefficients: Dichotomous x: b1: log(OR) for x=1 compared to x=0 after adjustment for the remaining x’s Continuous x: b1: log(OR) for an increment of 1 unit in x, after adjustment for the remaining x’s Thus: 10 x b1: log(OR) for an increment of 10 units of x, after adjustment for the remaining x’s CAUTION: Assumes linear increase in the log(OR) throughout the entire range of x values

Logistic Regression Using Dummy Variables: Cross-Sectional Association Between Demographic Factors and Depressive State, NHANES, Mexican-Americans Aged 20-74 Years, 1982-4

Generalized Linear Models

Logistic Regression Notes • Popularity of logistic regression results from its predictive ability (values above 1.0 or below 0 are impossible with this model). • Least squares solution for logistic regression does not work. Need maximum likelihood estimates…I.e., computers! • 95% confidence limits for the Odds Ratio

Multiple Regression