560 likes | 712 Views
Stats 330: Lecture 18. Anova Models. Anova Models. These are linear (regression) models where all the explanatory variables are categorical. If there is just one categorical explanatory variable, then we have the “one-way anova” model discussed in STATS 201/8
E N D
Stats 330: Lecture 18 Anova Models
Anova Models • These are linear (regression) models where all the explanatory variables are categorical. • If there is just one categorical explanatory variable, then we have the “one-way anova” model discussed in STATS 201/8 • If there are two categorical explanatory variables, then we have the “two-way anova” model, also discussed in STATS 201/8 • However, we shall regard these as just another type of regression model
Example: One way model • In an experiment to study the effect of carcinogenic substances, six different substances were applied to cell cultures. • The response variable (ratio) is the ratio of damaged to undamaged cells, and the explanatory variable (treatment) is the substance • On website – carcinogenic substances data
Data • ratio treatment • 1 0.08 control • 2 0.08 choral hydrate • 3 0.10 diazapan • 4 0.10 hydroquinone • 5 0.07 econidazole • 6 0.17 colchicine • 7 0.08 control • 8 0.10 choral hydrate • 9 0.08 diazapan • 10 0.10 hydroquinone • 11 0.08 econidazole • 12 0.19 colchicine • 13 0.09 control • 14 0.08 choral hydrate • 0.12 diazapan • . . . More data
boxplot(ratio~treatment, data=cancer.df, ylab = "ratio", main = "Ratios for different substances") Distributions skewed?
The model where the mean depends on the substance. Thus, We make the usual assumptions about the errors (normal, equal variance, independent etc)
Dummy variable form • Define: CH = 1 if treatment = Choral Hydrate, 0 else D = 1 if treatment = diazapan, 0 else H = 1 if treatment = hydroquinone, 0 else E = 1 if treatment = econidazole, 0 else C = 1 if treatment = colchicine, 0 else • Then
Estimation • To estimate the offsets and the baseline (control) mean, we use lm as usual. We have to rearrange the levels to make the control the baseline carcin.df = read.table(file.choose(), header=T) carcin.df$treatment = factor(carcin.df$treatment, levels = c("control", "chloralhydrate", "colchicine", "diazapan", "econidazole", "hydroquinone")) summary(lm(ratio~treatment, data=carcin.df))
lm output Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.23660 0.02037 11.616 < 2e-16 *** treatmentchloralhydrate 0.03240 0.02880 1.125 0.26158 treatmentcolchicine 0.21160 0.02880 7.346 2.02e-12 *** treatmentdiazapan 0.04420 0.02880 1.534 0.12599 treatmenteconidazole 0.02820 0.02880 0.979 0.32838 treatmenthydroquinone 0.07540 0.02880 2.618 0.00931 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.144 on 294 degrees of freedom Multiple R-squared: 0.1903, Adjusted R-squared: 0.1766 F-statistic: 13.82 on 5 and 294 DF, p-value: 3.897e-12
Variances about equal Non-normal? ignore
Analyzing ¼ power WB test: previous p=0.00 Current p=0.06 Normality better
Analyzing ¼ power: summary Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.68528 0.01244 55.105 < 2e-16 *** treatmentchloralhydrate 0.01744 0.01759 0.992 0.3222 treatmentcolchicine 0.12120 0.01759 6.891 3.37e-11 *** treatmentdiazapan 0.02993 0.01759 1.702 0.0898 . treatmenteconidazole 0.01470 0.01759 0.836 0.4039 treatmenthydroquinone 0.04104 0.01759 2.333 0.0203 * Residual standard error: 0.08793 on 294 degrees of freedom Multiple R-squared: 0.1714, Adjusted R-squared: 0.1573 F-statistic: 12.16 on 5 and 294 DF, p-value: 1.008e-10 Residual standard error: 0.08793 on 294 degrees of freedom Multiple R-Squared: 0.1714, Adjusted R-squared: 0.1573 F-statistic: 12.16 on 5 and 294 DF, p-value: 1.008e-10
Testing equality of means The standard F-test for equality of means is computed using the anova function Here comparing equal means model (Null model) with different means model – only one term in model > quarter.lm <- lm(ratio^(1/4)~treatment, data=carcin.df) > anova(quarter.lm) Analysis of Variance Table Response: ratio^(1/4) Df Sum Sq Mean Sq F value Pr(>F) treatment 5 0.47017 0.09403 12.161 1.008e-10 *** Residuals 294 2.27337 0.00773 Highly significant differences
Oneway plot plot (s20x) > onewayPlot(quarter.lm) Tukey: all cover true values with 95% prob
Two factors: example Experiment to study weight gain in rats • Response is weight gain over a fixed time period • This is modelled as a function of diet (Beef, Cereal, Pork) and amount of feed (High, Low)
Data > diets.df gain source level 1 73 Beef High 2 98 Cereal High 3 94 Pork High 4 90 Beef Low 5 107 Cereal Low 6 49 Pork Low 7 102 Beef High 8 74 Cereal High 9 79 Pork High 10 76 Beef Low . . . 60 observations in all
Two factors: the model • If the (continuous) response depends on two categorical explanatory variables, then we assume that the response is normally distributed with a mean depending on the combination of factor levels: if the factors are A and B, the mean at the i th level of A and the j th level of B is mij • Other standard assumptions (equal variance, normality, independence) apply
Decomposition of the means • We usually want to split each “cell mean” up into 4 terms: • A term reflecting the overall baseline level of the response • A term reflecting the effect of factor A (row effect) • A term reflecting the effect of factor B (column effect) • A term reflecting how A and B interact.
Mathematically… Overall Baseline: m11 (mean when both factors are at their baseline levels) Effect of i th level of factor A (row effect): mi1 - m11(The i th level of A, at the baseline of B, expressed as a deviation from the overall baseline) Effect of j th level of factor B (column effect) : m1j - m11 (The j th level of B, at the baseline of A, expressed as a deviation from the overall baseline) Interaction: what’s left over (see next slide)
Interactions • Each cell (except the first row and column) has an interaction: Interaction = cell mean - baseline - row effect - column effect cell mean = baseline + row effect + column effect + interaction
Notation • Overall baseline: m= m11 • Main effects of A: ai = mi1 - m11 • Main effects of B: bj = m1j - m11 • AB interactions: abij = mij - mi1 - m1j +m11 • Thus, mij = m + ai + bj +abij
Importance of interactions • If the interactions are all zero, then the effect of changing levels of A is the same for all levels of B • In mathematical terms, mij – mi’j doesn’t depend on j • Equivalently, effect of changing levels of B is the same for all levels of A • If interactions are zero, relationship between factors and response is simple
Why are comparisons simple when interactions are zero? Doesn’t depend on i!
Splitting up the mean: rats Factors are : level (amount of food) and source (diet) Row effect for Low: 79.2 – 100 = -20.8 Col effect for Cereal: 85.9 - 100 = -14.1 Col effect for Pork: 99.5 - 100 = -0.5 Low-Cereal interaction: 83.9 - 79.2 - 85.9 + 100 = 18.8 Low-Cereal interaction: 78.7 - 79.2 - 99.5 + 100 = 0
Exploratory plots > plot.design(diets.df) More gain on high amount of feed and Beef diet
Fit model > diets.lm<-lm(gain~source+level + source:level, data=diets.df) > summary(diets.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 sourceCereal -1.410e+01 6.551e+00 -2.152 0.03585 sourcePork -5.000e-01 6.551e+00 -0.076 0.93944 levelLow -2.080e+01 6.551e+00 -3.175 0.00247 sourceCereal:levelLow 1.880e+01 9.264e+00 2.029 0.04736 sourcePork:levelLow -3.052e-14 9.264e+00 -3.29e-15 1.00000 Residual standard error: 14.65 on 54 degrees of freedom Multiple R-Squared: 0.2848, Adjusted R-squared: 0.2185 F-statistic: 4.3 on 5 and 54 DF, p-value: 0.002299
Fitting as a regression model Note that this is equivalent to fitting a regression with dummy variables R2, C2, C3 R2 = 1 if obs is in row 2, zero otherwise C2 = 1 if obs is in column 2, zero otherwise C3 = 1 if obs is in column 3, zero otherwise The regression is Y ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3)
Regression summary > R2 = ifelse(diets.df$level=="Low",1,0) > C2 = ifelse(diets.df$source=="Cereal",1,0) > C3 = ifelse(diets.df$source=="Pork",1,0) > reg.lm = lm(gain ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3), data=diets.df) > summary(reg.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 *** C2 -1.410e+01 6.551e+00 -2.152 0.03585 * C3 -5.000e-01 6.551e+00 -0.076 0.93944 R2 -2.080e+01 6.551e+00 -3.175 0.00247 ** I(R2 * C2) 1.880e+01 9.264e+00 2.029 0.04736 * I(R2 * C3) -2.709e-14 9.264e+00 -2.92e-15 1.00000
Testing for zero interactions > anova(diets.lm) Analysis of Variance Table Response: gain Df Sum Sq Mean Sq F value Pr(>F) source 2 266.5 133.3 0.6211 0.5411319 level 1 3168.3 3168.3 14.7666 0.0003224 *** source:level 2 1178.1 589.1 2.7455 0.0731879 . Residuals 54 11586.0 214.6 Some evidence of interaction
Interaction plot > interaction.plot(source,level,gain) Non-parallel lines indicate interaction
Do we need Source in the model? > model1<-lm(gain~source*level) # note shorthand > model2<-lm(gain~level) > anova(model2,model1) Analysis of Variance Table Model 1: gain ~ level Model 2: gain ~ source * level Res.Df RSS Df Sum of Sq F Pr(>F) 1 58 13030.7 2 54 11586.0 4 1444.7 1.6833 0.1673 Not significant! No significant effect of Source
Notations: review For two factors A and B • Baseline: m = m11 • A main effect: ai = mi1-m11 • B main effect: bj = m1j -m11 • AB interaction: abij = mij -mi1-m1j +m11 • Thenmij=m+ai +bj +abij
Zero interaction model • If we have only one observation per factor level combination, we can’t estimate the interactions and the error variance • We have to assume that the interactions are zero and fit an “additive model” gain ~ level + source • Can test zero interactions In a reduced form) using the “Tukey one-degree of freedom test” –
Possible models for two factors For two factors A and B possible models are: • Y~1 (Fit single mean only) • Y~A (cell means depend on A alone) • Y~B (Cell means depend on B alone) • Y~A+B (Cell means have no interaction) • Y~A*B (General model, cell means have no restrictions)
In terms of “effects” General model is Y~A+B+A:B (equivalently Y~A*B) Mathematical form is E(Yij) = m + ai + bj + abij • Y~1 implies ai =0, bj=0, abij=0 • Y~A implies bj=0, abij=0 • Y~B implies ai =0, abij=0 • Y~A+B implies abij=0
Interpreting Anova All F-tests essentially compare a model to a sub-model, using an estimate of s2 in the denominator: The anova function can do this explicitly, as in anova(model1, model2), with the estimate of s2 coming from the bigger model. When we use just 1 argument, as in anova(model1), the models being compared are selected implicitly
Interpreting Anova (cont) • For example, consider a model with 2 factors A and B: > anova(lm(y~A+B+A:B)) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) A 1 12.774 12.774 9.9147 0.003978 ** B 2 4.031 2.015 1.5642 0.227629 A:B 2 6.898 3.449 2.6768 0.086985 . Residuals 27 34.788 1.288 Full-model estimate of s2
First line The first line of the table compares the model y~A with a null model (all means the same), using an estimate of s2 =1.288 from the full model y~A+B+A:B > model1<-lm(y~A) > model0<-lm(y~1) > anova(model0,model1) Analysis of Variance Table Model 1: y ~ 1 Model 2: y ~ A Res.Df RSS Df Sum of Sq F Pr(>F) 1 32 58.491 2 31 45.716 1 12.774 8.6623 0.006105 ** Difference in Numerator of F test
Second line The second line of the table compares the “no interaction” model y~A+B with the model y~A, using an estimate of s2 from the full model y~A+B+A:B > model2<-lm(y~A+B) > model1<-lm(y~A) > anova(model1,model2) Analysis of Variance Table Model 1: y ~ A Model 2: y ~ A + B Res.Df RSS Df Sum of Sq F Pr(>F) 1 31 45.716 2 29 41.685 2 4.031 1.4021 0.2623 Difference in Numerator of F test in line 2
Third line The third line of the table compares full model y~A+B+A:B with the “no interaction” model y~A+B, using an estimate of s2 from the full model > model2<-lm(y~A+B) > model3<-lm(y~A+B+A:B) > anova(model2,model3) Analysis of Variance Table Model 1: y ~ A + B Model 2: y ~ A + B + A:B Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 41.685 2 27 34.788 2 6.898 2.6768 0.08699 . Difference in Numerator of F test in line 3
To summarise: • Terms are added line by line • The F-test compares the current model with the previous model • At each stage, the estimate of s2 is obtained from the full model.
More than two factors: example • An experiment was conducted to compare different diets for feeding chickens. The diets depended on 3 variables: • Source of Protein (variable protein) : either “groundnut” or “soybean” • Level of protein (variable protlevel): either 0, 1 or 2 • Level of fish solubles (variable fish) :either high or low • Response variable was weight gain (variable chickweight)
data chickweight protein protlevel fish 1 6559 groundnut 0 Low 2 7075 groundnut 0 High 3 6564 groundnut 1 Low 4 7528 groundnut 1 High 5 6738 groundnut 2 Low 6 7333 groundnut 2 High 7 7094 soybean 0 Low 8 8005 soybean 0 High 9 6943 soybean 1 Low 10 7359 soybean 1 High 11 6748 soybean 2 Low 12 6764 soybean 2 High . . . 24 observations in all
Data characteristics • There are 3 factors • protein with 2 levels (groundnut, soybean) • protlevel with 3 levels (0,1,2) • fish with 2 levels high, low • There are 2 x 3 x 2 = 12 factor level combinations, so 12 means • Each combination is observed twice, so 24 observations in all
Interactions Let mijk be the population mean of all observations taken at level i of protein, level j of protlevel and level k of fish We can split this mean up into 8 terms: An overall baseline m=m111 3 “main effects” e.g.ai = mi11 - m111 3 “two-way interactions” e.g. (ab)ij= mij1 - mi11 - m1j1+ m111 A “3-way interaction” (abg)ijk= mijk- mi11 - m1j1- m11k + mij1 + m1jk +mi1k- m111
Interactions (cont) Then mijk = m + ai+ bj+ gk+ (ab)ij+ (bg)jk+ (ag)ik +(abg)ijk As before, if any one of the subscripts i, j, k is 1 then the corresponding interaction is zero. Interpretation: e.g. if the protlevel x fish and the 3-way interactions are all zero, then the effect of changing levels of fish is the same for all levels of protlevel.