Experimental Statistics - week 13

Experimental Statistics - week 13 Multiple Regression Miscellaneous Topics

Setting: We have a dependent variable Y and several candidate independent variables. Question: Should we use all of them?

Why do we run Multiple Regression? 1. Obtain estimates of individual coefficients in a model (+ or -, etc.) 2. Screen variables to determine which have a significant effect on the model 3. Arrive at the most effective (and efficient) prediction model

The problem: Collinearity among the independent variables -- high correlation between 2 independent variables -- one independent variable nearly a linear combination of other independent variables -- etc.

Effects of Collinearity • parameter estimates are highly variable and unreliable • - parameter estimates may even have the opposite sign from what is reasonable • may have significant F but none of the t-tests are significant Variable Selection Techniques Techniques for “being careful” about which variables are put into the model

Variable Selection Procedures • Forward selection • Backward Elimination • Stepwise • Best subset

Multiple Regression – Analysis Suggestions • Include only variables that make sense • Force imprtant variables into a model • Be wary of variable selection results • - especially forward selection • 4. Examine pairwise correlations among variables • 5. Examine pairwise scatterplots among variables • - identify nonlinearity • - identify unequal variance problems • - identify possible outliers • 5. Try transformations of variables for • - correcting nonlinearity • - stabilizing the variances • - inducing normality of residuals

SPSS Output from INFANT Data Set

SPSS Output from CAR Data Set

Examples of Nonlinear Data “Shapes” and Linearizing Transformations

Exponential Transformation(Log-Linear) Original Model 1 > 0 1 < 0 Transformed Into:

Transformed Multiplicative Model (Log-Log)

Square Root Transformation 1 > 0 1 < 0

Note: - transforming Y using the log or square root transformation can help with unequal variance problems - these transformations may also help induce normality

hmpg vs hp hmpg vs sqrt(hp) log(hmpg) vs hp log(hmpg) vs log(hp)

Polynomial Regression: - basically a multiple regression where the independent variables are powers of a single independent variable - use SAS to compute the independent variablesx2, x3, … , xp

Outlier Detection - there are tests for outliers - throwing away outliers should technically be done only when there is evidence that the values “do not belong”

Use of Dummy Variables in Regression

Example 6.1, Text page 268-269 Does a drug retains its potency after 1 year of storage? 2 groups: 1) fresh product 2) product stored for 1 year n = 10 observations from each group -- indep. samples) FreshStored 10.2 9.8 10.5 9.6 . . . . . . Variable measured is potency reading Question:How would you compare groups?

We want to test: We could use: - independent groups t-test - 1-factor ANOVA (with 2 levels of the factor) 1-Factor ANOVA Model wherem + a1 = mean of fresh product m + a2 = mean of 1-year old product

data ott269; input type$ y; datalines; F 10.2 F 10.5 F 10.3 F 10.8 F 9.8 . . . S 9.6 S 9.8 S 9.9 ; procglm; class type; model y=type; means type/lsd; title 'ANOVA -- Potency Data - page 269 (t-test)'; run;

ANOVA -- Potency Data - page 269 (t-test) The GLM Procedure Class Level Information Class Levels Values type 2 F S The GLM Procedure Dependent Variable: y Sum of Source DF Squares Mean Square F Value Pr > F Model 1 1.45800000 1.45800000 17.95 0.0005 Error 18 1.46200000 0.08122222 Corrected Total 19 2.92000000 R-Square Coeff Var Root MSE potency Mean 0.499315 2.821734 0.284995 10.10000 Source DF Type I SS Mean Square F Value Pr > F type 1 1.45800000 1.45800000 17.95 0.0005 Source DF Type III SS Mean Square F Value Pr > F type 1 1.45800000 1.45800000 17.95 0.0005

Since p =.0005 we reject and conclude that storage time does make a difference. t Tests (LSD) for y NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 18 Error Mean Square 0.081222 Critical Value of t 2.10092 Least Significant Difference 0.2678 Means with the same letter are not significantly different. t Grouping Mean N type A 10.3700 10 F B 9.8300 10 S Fresh product has higher potency on average. Also – estimated difference in means = 10.37 – 9.83 = .54

Regression analysis – requires the independent variables to be quantitative Let’s consider recoding the group membership variable (i.e. F and S) into the numeric scores: 0 = fresh 1 = stored one year and running a regression analysis with this new “dummy” variable as a “quantitative” independent variable - let’s call the “dummy” variable x. Regression Model

data ott269; input x y; datalines; 0 10.2 0 10.5 0 10.3 0 10.8 0 9.8 . . . 1 9.6 1 9.8 1 9.9 ; procreg; model y=x; title ‘Regression Analysis -- Potency Data - page 269'; run;

The REG Procedure Dependent Variable: y Number of Observations Read 20 Number of Observations Used 20 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 1.45800 1.45800 17.95 0.0005 Error 18 1.46200 0.08122 Corrected Total 19 2.92000 Root MSE 0.28500 R-Square 0.4993 Dependent Mean 10.10000 Adj R-Sq 0.4715 Coeff Var 2.82173 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 10.37000 0.09012 115.06 <.0001 x 1 -0.54000 0.12745 -4.24 0.0005 Regression Equation:

Note: the regression model On the basis of this model:

Dummy Variables with More than 2 Groups Example: Balloon Data - 4 groups

Recall: 1122.4 2324.6 3120.3 4419.8 5324.3 6222.2 7228.5 8225.7 9320.2 10119.6 11228.8 12424.0 13417.1 14419.3 15324.2 16115.8 17218.3 18117.5 19418.7 20322.9 21116.3 22414.0 23416.6 24218.1 25218.9 26416.0 27220.1 28322.5 29316.0 30119.3 31115.9 32320.3 Balloon Data Col. 1-2 - observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 4-7 - inflation time in seconds “Research Question”: Is the average time required to inflate the balloons the same for each color?

Analysis using 1-factor ANOVA Model with 4 Groups GLM Procedure ANOVA --- Balloon Data Dependent Variable: time Sum of Source DF Squares Mean Square F Value Pr > F Model 3 126.1512500 42.0504167 3.85 0.0200 Error 28 305.6475000 10.9159821 Corrected Total 31 431.7987500 R-Square Coeff Var Root MSE time Mean 0.292153 16.31069 3.303934 20.25625 Source DF Type I SS Mean Square F Value Pr > F color 3 126.1512500 42.0504167 3.85 0.0200 LSD Results Grouping Mean N color A 22.575 8 2(yellow) A A 21.875 8 3(orange) B 18.388 8 1(pink) B B 18.188 8 4(blue)

Dummy Variables For 4 groups -- 3 dummy variables needed. 0, 0, 0 → group 1 1, 0, 0 → group 2 0, 1, 0 → group 3 0, 0, 1 → group 4

Dummy Variables for 4 Groups: The model says: The mean for color 1(i.e.x1 = 0, x2 = 0, x3 = 0)isb0 - notation m1= b0 The mean for color 2(i.e.x1 = 1, x2 = 0, x3 = 0)isb0 + b1 - notation m2 = b0 + b1 The mean for color 3(i.e.x1 = 0, x2 = 1, x3 = 0)isb0 + b2 - notation m3 = b0 + b2 The mean for color 4(i.e.x1 = 0, x2 = 0, x3 = 1)isb0 + b3 - notation m4 = b0 + b3

Dummy Variables for 4 Groups:

11 000 22.4 23 010 24.6 31 000 20.3 44 001 19.8 53 010 24.3 62 100 22.2 72 100 28.5 82 100 25.7 93 010 20.2 101 000 19.6 112 100 28.8 124 001 24.0 134 001 17.1 144 001 19.3 153 010 24.2 161 000 15.8 172 100 18.3 181 000 17.5 194 001 18.7 203 010 22.9 211 000 16.3 224 001 14.0 234 001 16.6 242 100 18.1 252 100 18.9 264 001 16.0 272 100 20.1 283 010 22.5 293 010 16.0 301 000 19.3 311 000 15.9 323 010 20.3 Balloon Data Set with Dummy Variables: Col. 1-2 - observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 5 X1 Col. 6 X2 Col. 7 X3 Col. 9-12 - inflation time in seconds

ANOVA --- Balloon Data using Dummy Variables The REG Procedure Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 126.15125 42.05042 3.85 0.0200 Error 28 305.64750 10.91598 Corrected Total 31 431.79875 Root MSE 3.30393 R-Square 0.2922 Dependent Mean 20.25625 Adj R-Sq 0.2163 Coeff Var 16.31069 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 18.38750 1.16812 15.74 <.0001 x1 1 4.18750 1.65197 2.53 0.0171 x2 1 3.48750 1.65197 2.11 0.0438 x3 1 -0.20000 1.65197 -0.12 0.9045

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 18.38750 1.16812 15.74 <.0001 x1 1 4.18750 1.65197 2.53 0.0171 x2 1 3.48750 1.65197 2.11 0.0438 x3 1 -0.20000 1.65197 -0.12 0.9045 (i.e. “pink” ≠ “yellow”) i.e. conclude “pink” ≠ “orange” i.e. we cannot conclude “pink” and “blue” are different Recall LSD Results Grouping Mean N color A 22.575 8 2(yellow) A A 21.875 8 3(orange) B 18.388 8 1(pink) B B 18.188 8 4(blue)

Dummy Variables We showed that 1-factor ANOVA can be run using regression analysis with dummy variables. Question:What’s the real benefit of dummy variables? Answer:Dummy variables can be mixed in with quantitative independent variables to give a combination of regression and ANOVA analyses.

Survival Data • study using 108 patients in a surgical unit. • researchers interested in predicting the survival time (in days) of patients undergoing a type of liver operation Independent Variables clot = blood clotting score prog = prognostic index enzyme = enzyme function test score liver = liver function test score age = age in years gender (0 = male, 1 = female) alch1, alch2 = indicator of alcohol usage None: alch1 = 0, alch2 = 0 Moderate: alch1 = 1, alch2 = 0 Heavy: alch1 = 0, alch2 = 1

Survival Data DATA survival; INPUT clot prog enzyme liver age gender alch1 alch2 survival; DATALINES; 6.7 62 81 2.59 50 0 1 0 695 5.1 59 66 1.70 39 0 0 0 403 7.4 57 83 2.16 55 0 0 0 710 6.5 73 41 2.01 48 0 0 0 349 7.8 65 115 4.30 45 0 0 1 2343 5.8 38 72 1.42 65 1 1 0 348 . . . ; Gender: 0=male, 1=female Alcohol Use alch1 alch2 None 0 0 Moderate 1 0 Heavy 0 1 PROCreg; MODEL survival=clot prog enzyme liver age/selection=adjrsq; output out=new r=ressurv p=predsurv; RUN; PROCreg; MODEL lgsurv=clot prog enzyme liver age/selection=adjrsq; output out=new r=ressvlg p=predsvlg; RUN;

Adjusted R-Square Selection Method Dependent Variable: survival Number in Adjusted Model R-Square R-Square Variables in Model 6 0.7611 0.7745 clot prog enzyme liver alch1 alch2 5 0.7606 0.7718 clot prog enzyme liver alch2 7 0.7592 0.7749 clot prog enzyme liver age alch1 alch2 7 0.7591 0.7748 clot prog enzyme liver gender alch1 alch2 6 0.7587 0.7723 clot prog enzyme liver age alch2 6 0.7587 0.7722 clot prog enzyme liver gender alch2 8 0.7571 0.7753 clot prog enzyme liver age gender alch1 alch2 7 0.7568 0.7727 clot prog enzyme liver age gender alch2 5 0.7416 0.7536 clot prog enzyme alch1 alch2 Dependent Variable: log(survival) Number in Adjusted Model R-Square R-Square Variables in Model 6 0.7649 0.7781 clot prog enzyme liver gender alch2 7 0.7634 0.7789 clot prog enzyme liver gender alch1 alch2 5 0.7628 0.7738 clot prog enzyme liver alch2 7 0.7627 0.7782 clot prog enzyme liver age gender alch2 6 0.7614 0.7747 clot prog enzyme liver alch1 alch2 8 0.7612 0.7790 clot prog enzyme liver age gender alch1 alch2 6 0.7605 0.7740 clot prog enzyme liver age alch2 7 0.7591 0.7749 clot prog enzyme liver age alch1 alch2

6 variable model for log(survival) selected by adjusted R2 Dependent Variable: lgsurv Number of Observations Read 108 Number of Observations Used 108 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 20.33867 3.38978 59.04 <.0001 Error 101 5.79922 0.05742 Corrected Total 107 26.13789 Root MSE 0.23962 R-Square 0.7781 Dependent Mean 6.36909 Adj R-Sq 0.7649 Coeff Var 3.76224 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 3.91009 0.17992 21.73 <.0001 clot 1 0.06227 0.02023 3.08 0.0027 prog 1 0.01321 0.00158 8.38 <.0001 enzyme 1 0.01387 0.00141 9.84 <.0001 liver 1 0.06695 0.03547 1.89 0.0620 gender 1 0.06659 0.04766 1.40 0.1654 alch2 1 0.28922 0.05983 4.83 <.0001

5 variable model for log(survival) selected by Backward Elimination Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 20.22658 4.04532 69.80 <.0001 Error 102 5.91132 0.05795 Corrected Total 107 26.13789 Root MSE 0.24074 R-Square 0.7738 Dependent Mean 6.36909 Adj R-Sq 0.7628 Coeff Var 3.77976 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 3.92845 0.18027 21.79 <.0001 clot 1 0.05942 0.02022 2.94 0.0041 prog 1 0.01342 0.00158 8.50 <.0001 enzyme 1 0.01387 0.00142 9.80 <.0001 liver 1 0.07362 0.03531 2.08 0.0396 alch2 1 0.28799 0.06010 4.79 <.0001

What is the role of the variable “alch2” in the model? None: (0,0) mean survival = 640.5 Moderate: (1,0) mean survival = 608.4 Severe: (0,1) mean survival = 815.2 alch2 =1 implies heavy alch2 = 0 implies none or moderate

Experimental Statistics - week 13