610 likes | 723 Views
Experimental Statistics - week 14. Multiple Regression – miscellaneous topics. Polynomial Regression:. - we looked at this briefly in Lab. - basically a multiple regression where the independent variables are powers of a single independent variable.
E N D
Experimental Statistics - week 14 Multiple Regression – miscellaneous topics
Polynomial Regression: - we looked at this briefly in Lab - basically a multiple regression where the independent variables are powers of a single independent variable - use SAS to compute the independent variablesx2, x3, … , xp
Outlier Detection - there are tests for outliers - throwing away outliers should technically be done only when there is evidence that the values “do not belong”
Example 6.1, Text page 268-269 Does a drug retains its potency after 1 year of storage? 2 groups: 1) fresh product 2) product stored for 1 year n = 10 observations from each group -- indep. samples) FreshStored 10.2 9.8 10.5 9.6 . . . . . . Variable measured is potency reading Question: How would you compare groups?
We want to test: 1-Factor ANOVA Model wherem + a1 = mean of fresh product m + a2 = mean of 1-year old product
data ott269; input type$ y; datalines; F 10.2 F 10.5 F 10.3 F 10.8 F 9.8 . . . S 9.6 S 9.8 S 9.9 ; procglm; class type; model y=type; means type/lsd; title 'ANOVA -- Potency Data - page 269 (t-test)'; run;
ANOVA -- Potency Data - page 269 (t-test) The GLM Procedure Class Level Information Class Levels Values type 2 F S The GLM Procedure Dependent Variable: y Sum of Source DF Squares Mean Square F Value Pr > F Model 1 1.45800000 1.45800000 17.95 0.0005 Error 18 1.46200000 0.08122222 Corrected Total 19 2.92000000 R-Square Coeff Var Root MSE potency Mean 0.499315 2.821734 0.284995 10.10000 Source DF Type I SS Mean Square F Value Pr > F type 1 1.45800000 1.45800000 17.95 0.0005 Source DF Type III SS Mean Square F Value Pr > F type 1 1.45800000 1.45800000 17.95 0.0005
Since p =.0005 we reject and conclude that storage time does make a difference. t Tests (LSD) for y NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 18 Error Mean Square 0.081222 Critical Value of t 2.10092 Least Significant Difference 0.2678 Means with the same letter are not significantly different. t Grouping Mean N type A 10.3700 10 F B 9.8300 10 S Fresh product has higher potency on average. Also – estimated difference in means = 10.37 – 9.83 = .54
Regression analysis – requires the independent variables to be quantitative Let’s consider recoding the group membership variable (i.e. F and S) into the numeric scores: 0 = fresh 1 = stored one year and running a regression analysis with this new “dummy” variable as a “quantitative” independent variable - let’s call the “dummy” variable x. Regression Model
data ott269; input x y; datalines; 0 10.2 0 10.5 0 10.3 0 10.8 0 9.8 . . . 1 9.6 1 9.8 1 9.9 ; procreg; model y=x; title ‘Regression Analysis -- Potency Data - page 269'; run;
The REG Procedure Dependent Variable: y Number of Observations Read 20 Number of Observations Used 20 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 1.45800 1.45800 17.95 0.0005 Error 18 1.46200 0.08122 Corrected Total 19 2.92000 Root MSE 0.28500 R-Square 0.4993 Dependent Mean 10.10000 Adj R-Sq 0.4715 Coeff Var 2.82173 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 10.37000 0.09012 115.06 <.0001 x 1 -0.54000 0.12745 -4.24 0.0005 Regression Equation:
Note: the regression model On the basis of this model:
Dummy Variables with More than 2 Groups Example: Balloon Data - 4 groups
Recall: 1122.4 2324.6 3120.3 4419.8 5324.3 6222.2 7228.5 8225.7 9320.2 10119.6 11228.8 12424.0 13417.1 14419.3 15324.2 16115.8 17218.3 18117.5 19418.7 20322.9 21116.3 22414.0 23416.6 24218.1 25218.9 26416.0 27220.1 28322.5 29316.0 30119.3 31115.9 32320.3 Balloon Data Col. 1-2 - observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 4-7 - inflation time in seconds “Research Question”: Is the average time required to inflate the balloons the same for each color?
Analysis using 1-factor ANOVA Model with 4 Groups GLM Procedure ANOVA --- Balloon Data Dependent Variable: time Sum of Source DF Squares Mean Square F Value Pr > F Model 3 126.1512500 42.0504167 3.85 0.0200 Error 28 305.6475000 10.9159821 Corrected Total 31 431.7987500 R-Square Coeff Var Root MSE time Mean 0.292153 16.31069 3.303934 20.25625 Source DF Type I SS Mean Square F Value Pr > F color 3 126.1512500 42.0504167 3.85 0.0200 LSD Results Grouping Mean N color A 22.575 8 2(yellow) A A 21.875 8 3(orange) B 18.388 8 1(pink) B B 18.188 8 4(blue)
Dummy Variables For 4 groups -- 3 dummy variables needed.
11 000 22.4 23 010 24.6 31 000 20.3 44 001 19.8 53 010 24.3 62 100 22.2 72 100 28.5 82 100 25.7 93 010 20.2 101 000 19.6 112 100 28.8 124 001 24.0 134 001 17.1 144 001 19.3 153 010 24.2 161 000 15.8 172 100 18.3 181 000 17.5 194 001 18.7 203 010 22.9 211 000 16.3 224 001 14.0 234 001 16.6 242 100 18.1 252 100 18.9 264 001 16.0 272 100 20.1 283 010 22.5 293 010 16.0 301 000 19.3 311 000 15.9 323 010 20.3 Balloon Data Set with Dummy Variables: Col. 1-2 - observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 5 X1 Col. 6 X2 Col. 7 X3 Col. 9-12 - inflation time in seconds
ANOVA --- Balloon Data using Dummy Variables The REG Procedure Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 126.15125 42.05042 3.85 0.0200 Error 28 305.64750 10.91598 Corrected Total 31 431.79875 Root MSE 3.30393 R-Square 0.2922 Dependent Mean 20.25625 Adj R-Sq 0.2163 Coeff Var 16.31069 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 18.38750 1.16812 15.74 <.0001 x1 1 4.18750 1.65197 2.53 0.0171 x2 1 3.48750 1.65197 2.11 0.0438 x3 1 -0.20000 1.65197 -0.12 0.9045
Recall:Mixed Model Multiple Comparisons for Fixed Effect (Inspection Level) -- Use MSAB in place of MSE where ▪ N denotes the # of observations involved in the computation of a marginal mean ▪ vdenotes the df associated with AB interaction
General Rule When comparing means using a multiple comparison procedure (i.e. LSD, Bonferroni, etc.) use the MS used in the denominator of the associated F-test Note: SAS always gives multiple comparison results using MSE
Ewe Data – problem 1 PROCGLM; class group ewe week; TITLE 'Ewe Study'; model milk=group ewe(group) week group*week; random ewe(group)/test; means group week/lsd; output out=newe r=resmilk; RUN; The GLM Procedure Dependent Variable: milk Sum of Source DF Squares Mean Square F Value Pr > F Model 23 388848.1481 16906.4412 8.54 <.0001 Error 30 59411.1111 1980.3704 Corrected Total 53 448259.2593 R-Square Coeff Var Root MSE milk Mean 0.867463 21.57157 44.50135 206.2963 Source DF Type I SS Mean Square F Value Pr > F group 2 256803.7037 128401.8519 64.84 <.0001 ewe(group) 6 114788.8889 19131.4815 9.66 <.0001 week 5 2970.3704 594.0741 0.30 0.9090 group*week 10 14285.1852 1428.5185 0.72 0.6983
Ewe Study The GLM Procedure Source Type III Expected Mean Square group Var(Error) + 6 Var(ewe(group)) + Q(group,group*week) ewe(group) Var(Error) + 6 Var(ewe(group)) week Var(Error) + Q(week,group*week) group*week Var(Error) + Q(group*week) Ewe Study The GLM Procedure Tests of Hypotheses for Mixed Model Analysis of Variance Dependent Variable: milk Source DF Type III SS Mean Square F Value Pr > F * group 2 256804 128402 6.71 0.0295 Error 6 114789 19131 Error: MS(ewe(group)) * This test assumes one or more other fixed effects are zero. Source DF Type III SS Mean Square F Value Pr > F ewe(group) 6 114789 19131 9.66 <.0001 * week 5 2970.370370 594.074074 0.30 0.9090 group*week 10 14285 1428.518519 0.72 0.6983 Error: MS(Error) 30 59411 1980.370370 * This test assumes one or more other fixed effects are zero.
Ewe Study The GLM Procedure Source Type III Expected Mean Square group Var(Error) + 6 Var(ewe(group)) + Q(group,group*week) ewe(group) Var(Error) + 6 Var(ewe(group)) week Var(Error) + Q(week,group*week) group*week Var(Error) + Q(group*week) Ewe Study The GLM Procedure Tests of Hypotheses for Mixed Model Analysis of Variance Dependent Variable: milk Source DF Type III SS Mean Square F Value Pr > F * group 2 256804 128402 6.71 0.0295 Error 6 114789 19131 Error: MS(ewe(group)) * This test assumes one or more other fixed effects are zero. Source DF Type III SS Mean Square F Value Pr > F ewe(group) 6 114789 19131 9.66 <.0001 * week 5 2970.370370 594.074074 0.30 0.9090 group*week 10 14285 1428.518519 0.72 0.6983 Error: MS(Error) 30 59411 1980.370370 * This test assumes one or more other fixed effects are zero.
t Tests (LSD) for Group Differences NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 30 Error Mean Square 1980.37 Critical Value of t 2.04227 Least Significant Difference 30.295 Means with the same letter are not significantly different. t Grouping Mean N group A 291.67 18 1 B 204.44 18 2 C 122.78 18 3
t Tests (LSD) for Group Differences NOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 30 Error Mean Square 1980.37 Critical Value of t 2.04227 Least Significant Difference 30.295 Means with the same letter are not significantly different. t Grouping Mean N group A 291.67 18 1 B 204.44 18 2 C 122.78 18 3 Corrected: “Error” Degrees of Freedom = “Error” Mean Square = Critical Value of t = Least Significant Difference =
Ewe Data interaction plot – Milk Production by Week Why non-normal? residuals milk
Kidney Data Log Model Original Model Outliers Removed
Kidney Data Original Model Log Model R2=.855 R2=.866
Kidney Data Original Model Outliers Removed R2=.855 R2=.871
Kidney Data Log Model Log Model – Outliers Removed R2=.866 R2=.901
Survival Data Log Survival vs Other Original Variables Original Variables
Survival Data Survival vs Square of Independent Variables Original Variables
Dependent Variable: Survival Number in Adjusted Model R-Square R-Square Variables in Model 4 0.6999 0.7112 clot prog enzyme liver 5 0.6970 0.7112 clot prog enzyme liver age 3 0.6908 0.6995 clot prog enzyme 4 0.6878 0.6995 clot prog enzyme age 3 0.6520 0.6618 prog enzyme liver 4 0.6487 0.6618 prog enzyme liver age 2 0.5750 0.5829 prog enzyme 3 0.5709 0.5829 prog enzyme age Dependent Variable: Log(Survival) Number in Adjusted Model R-Square R-Square Variables in Model 4 0.7122 0.7229 clot prog enzyme liver 3 0.7115 0.7196 clot prog enzyme 5 0.7104 0.7239 clot prog enzyme liver age 4 0.7098 0.7207 clot prog enzyme age 3 0.6781 0.6871 prog enzyme liver 4 0.6758 0.6879 prog enzyme liver age 2 0.6412 0.6479 prog enzyme 3 0.6388 0.6489 prog enzyme age 4 0.5357 0.5531 clot enzyme liver age
Grades “Conditional” – under assumption of good performance on next Thursday’s lab From Syllabus GRADE COMPUTATION: Exam Grades (75%) Daily Assignments (25%) Final Exam-- optional (scheduled for 8:00 AM – 11:00 AM Friday, May 6) -- “in class” exam -- will be averaged in equally with the other 2 exams to comprise 75% of grade - can raise or lower final grade
Dummy Variables We showed that 1-factor ANOVA can be run using regression analysis with dummy variables. Question:What’s the benefit? Answer:Dummy variables can be mixed in with regular quantitative variables to give a combination of regression and ANOVA analyses.
Dummy Variables for 4 Groups: For 4 groups -- 3 dummy variables needed. 0, 0, 0 → group 1 1, 0, 0 → group 2 0, 1, 0 → group 3 0, 0, 1 → group 4
1122.4 2324.6 3120.3 4419.8 5324.3 6222.2 7228.5 8225.7 9320.2 10119.6 11228.8 12424.0 13417.1 14419.3 15324.2 16115.8 17218.3 18117.5 19418.7 20322.9 21116.3 22414.0 23416.6 24218.1 25218.9 26416.0 27220.1 28322.5 29316.0 30119.3 31115.9 32320.3 Balloon Data Col. 1-2 - observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 4-7 - inflation time in seconds “Research Question”: Is the average time required to inflate the balloons the same for each color?
Analysis using 1-factor ANOVA Model with 4 Groups GLM Procedure ANOVA --- Balloon Data Dependent Variable: time Sum of Source DF Squares Mean Square F Value Pr > F Model 3 126.1512500 42.0504167 3.85 0.0200 Error 28 305.6475000 10.9159821 Corrected Total 31 431.7987500 R-Square Coeff Var Root MSE time Mean 0.292153 16.31069 3.303934 20.25625 Source DF Type I SS Mean Square F Value Pr > F color 3 126.1512500 42.0504167 3.85 0.0200 LSD Results Grouping Mean N color A 22.575 8 2(yellow) A A 21.875 8 3(orange) B 18.388 8 1(pink) B B 18.188 8 4(blue)
11 000 22.4 23 010 24.6 31 000 20.3 44 001 19.8 53 010 24.3 62 100 22.2 72 100 28.5 82 100 25.7 93 010 20.2 101 000 19.6 112 100 28.8 124 001 24.0 134 001 17.1 144 001 19.3 153 010 24.2 161 000 15.8 172 100 18.3 181 000 17.5 194 001 18.7 203 010 22.9 211 000 16.3 224 001 14.0 234 001 16.6 242 100 18.1 252 100 18.9 264 001 16.0 272 100 20.1 283 010 22.5 293 010 16.0 301 000 19.3 311 000 15.9 323 010 20.3 Balloon Data Set with Dummy Variables: Col. 1-2 - observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 5 X1 Col. 6 X2 Col. 7 X3 Col. 9-12 - inflation time in seconds
ANOVA --- Balloon Data using Dummy Variables The REG Procedure Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 126.15125 42.05042 3.85 0.0200 Error 28 305.64750 10.91598 Corrected Total 31 431.79875 Root MSE 3.30393 R-Square 0.2922 Dependent Mean 20.25625 Adj R-Sq 0.2163 Coeff Var 16.31069 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 18.38750 1.16812 15.74 <.0001 x1 1 4.18750 1.65197 2.53 0.0171 x2 1 3.48750 1.65197 2.11 0.0438 x3 1 -0.20000 1.65197 -0.12 0.9045
Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 18.38750 1.16812 15.74 <.0001 x1 1 4.18750 1.65197 2.53 0.0171 x2 1 3.48750 1.65197 2.11 0.0438 x3 1 -0.20000 1.65197 -0.12 0.9045 (i.e. “pink” ≠ “yellow”) i.e. conclude “pink” ≠ “orange” i.e. we cannot conclude “pink” and “blue” are different