250 likes | 338 Views
Interactions. Interaction : Does the relationship between two variables depend on a third variable? Does the relationship of age to BP depend on gender Does a certain BP-lowering drug work as well in blacks than in non-blacks
E N D
Interactions • Interaction: Does the relationship between two variables depend on a third variable? • Does the relationship of age to BP depend on gender • Does a certain BP-lowering drug work as well in blacks than in non-blacks • Does the relationship between education and income differ by region of the country Sometimes called “effect modification”
Model for FEV Example Y = b0 + b1X1 + b2X2 X1 = smoking status (1=smoker, 0=nonsmoker) X2 = age Smokers FEV = b0 + b1 + b2age Non Smokers FEV = b0 + b2age FEV (smokers) – FEV (non-smokers) = b1 Assumes the slope of age is same for smokers and non-smokers
Non-smokers FEV Smokers b1 b2 b1 b2 AGE
Modeling Interaction for FEV Example Y = b0 + b1X1 + b2X2 + b3X3 X1 = smoking status (1=smoker, 0=nonsmoker) X2 = age X3 = age x smoking status Smokers: FEV = Non Smokers: FEV = FEV (Smokers) – FEV (Non-smokers) = Ho: b3 = 0 b0 + b1 + (b2 + b3) age b0 + b2 age b1 + b3age
Note: Difference in slopes implies smoker/nonsmoker difference depends on age (and vice versa) Non-smokers FEV b1 + b3age smokers b2 b2 + b3 AGE
DATA fev; INFILE DATALINES; INPUT age smk fev; agesmk = age*smk; DATALINES; 28 1 4.0 30 1 3.9 30 1 3.7 31 1 3.6
PROCREG; MODEL fev = age; PLOT fev*age; WHERE smk=0; TITLE'Non-smokers'; RUN; PROCREG; MODEL fev = age; PLOT fev*age; WHERE smk=1; TITLE'Smokers'; RUN;
SMOKERS Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 5.50002 0.36163 15.21 <.0001 age 1 -0.05508 0.00885 -6.22 <.0001 NON SMOKERS Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 5.24764 0.38050 13.79 <.0001 age 1 -0.03911 0.00887 -4.41 0.0007 B1 for smokers = -0.05508 B1 for non-smk = -0.03911 Are these statistically significant?
PROCREG; MODEL fev = age smk agesmk; RUN; Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 5.24764 0.37846 13.87 <.0001 age 1 -0.03911 0.00882 -4.43 0.0002 smk 1 0.25238 0.52482 0.48 0.6346 agesmk 1 -0.01597 0.01253 -1.27 0.2138 Interpretation: B(agesmk) = -0.01597 is difference in slopes between smk/nonsmk B(age) = -0.03911 is slope for non-smokers (smk=0) SMOKERS Intercept 1 5.50002 0.36163 15.21 <.0001 age 1 -0.05508 0.00885 -6.22 <.0001 NON-SMOKERS Intercept 1 5.24764 0.38050 13.79 <.0001 age 1 -0.03911 0.00887 -4.41 0.0007
Polynomial Regression: Adding Quadratic Term Y = bo + b1X+ b2X2 • Can be used if linear relationship does not hold • Example: alcohol intake and mortality • Example: cholesterol and mortality • Add a quadratic (squared) term • Can test hypothesis that quadratic term in needed • Ho: b2 = 0 • Ha: b2≠ 0
Adding Quadratic Term Plot mvo2kg*ffbw predicted.*ffbw/overlay
PROCREGDATA = physfit ; MODEL mvo2kg = ffbw; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 22211 22211 3.33 0.0724 Error 69 460225 6669.93228 Corrected Total 70 482436 Root MSE 81.66965 R-Square 0.0460 Dependent Mean 455.26761 Adj R-Sq 0.0322 Coeff Var 17.93882 Variable DF Estimate SE t Value Pr > |t| Intercept 1 382.51711 41.02856 9.32 <.0001 ffbw 1 0.17710 0.09705 1.82 0.0724
PROCREGDATA = physfit ; MODEL mvo2kg = ffbw; MODEL mvo2kg = ffbw ffbw2; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 113179 56589 10.42 0.0001 Error 68 369257 5430.25411 Corrected Total 70 482436 Root MSE 73.69026 R-Square 0.2346 Dependent Mean 455.26761 Adj R-Sq 0.2121 Coeff Var 16.18614 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 980.95393 150.82611 6.50 <.0001 ffbw 1 -2.68220 0.70406 -3.81 0.0003 ffbw2 1 0.00322 0.00078761 4.09 0.0001 ffbw2 = ffbw * ffbw Computed in datastep
Model Selection • Measure many predictors; how do you decide which to include in your model? • Depends on reason for fitting model • Prediction? Examine specific effects? • Statistical criteria do exist, should not be used in place of scientific criteria • Best used in exploratory context
Statistical principles to use • Forward, backward, and stepwise selection • Compare p-values of terms; add/remove based on = 0.05 or 0.10 • R2 methods • Look for models with highest R2 • Other methods exist
Possible Uses for Using Statistical Criteria • Outcome: Measure of Teenage Drinking • Many Possible Predictors • Questionnaire on relationships, friends, family, church support etc. • Outcome: Echocardographic determined hypertrophy of the heart • Many Possible ECG predictors • Computer measurements from ECG
Backward selection procedure Removes worst variable, then second worst, etc PROCREGDATA = physfit; MODEL mvo2kg = male age hgt wgt ffbw rhr / selection=backward; RUN; Final model: Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 574.86126 56.50900 167151 103.49 <.0001 male 88.90825 12.02381 88312 54.68 <.0001 age -6.85862 3.80692 5242.56660 3.25 0.0762 wgt -6.00865 1.02203 55827 34.56 <.0001 ffbw 0.75073 0.12729 56184 34.79 <.0001 rhr -0.79442 0.41916 5801.82822 3.59 0.0625
Forward selection procedure Start with best single variable, adds next best, etc PROCREGDATA = physfit; MODEL mvo2kg = male age hgt wgt ffbw rhr / selection=forward; RUN; This example - ends up including all terms except height • Exactly same model as one picked by backward selection
“MAXR” method Select several models based on maximal R2 PROCREGDATA = physfit; MODEL mvo2kg = male age hgt wgt ffbw rhr / selection=maxr; RUN; • Will give “best” models with 1, 2, 3... Terms • You choose best overall among the “best”
Two general principles to use • Parsimony - less is more • Common sense • Don’t use social security number to predict height! • Cautionary Note • Models with several variables are not as good at predicting as model might suggest.