Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Introduction to Biostatistical Analysis Using RStatistics course for first-year PhD students Session 3 Lecture: Analysis of Variance (ANOVA) Practical: ANOVA Lecturer: Lorenzo Marini DAFNAE, University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. E-mail: lorenzo.marini@unipd.it Tel.: +39 0498272807 http://www.biodiversity-lorenzomarini.eu/

Statistical modelling: more than one parameter Nature of the response variable Generalized Linear Models NORMAL (continuous) POISSON, BINOMIAL … GLM (not covered) General Linear Models Nature of the explanatory variables Categorical Continuous Categorical + continuous ANOVA Regression ANCOVA Session 3 Session 4

ANOVA: aov() ANOVA tests mean differences between groups defined by categorical variables One-way ANOVA ONE factor with 2 or more levels Multi-way ANOVA 2 or more factors, each with 2 or more levels Fertirrigation: 4 levels drug: 4 doses ♀ Gender ♂ ASSUMPTIONS Independence of cases - this is a requirement of the design. Normality - the distributions in each cells are normal [hist(), qq.plot(), shapiro.test()] Homogeneity of variances - the variance of data in groups should be the same (variance homogeneity with fligner.test()).

One-way ANOVA step by step 1. Test normality 2. Test homogeneity of variance within each group 3. Run the ANOVA 4. Reject/accept the H0 (all the means are equal) 2 approaches 5A.Multiple comparison to test differences between the level of factors 5B.Model simplification working with contrasts

One-way ANOVA Maize: 4 varieties (k) One-way ANOVA is used to test for differences among two or more independent groups y: productivity (NORMAL CONTINUOS) x: variety (CATEGORICAL: four levels: x1, x2, x3, x4) Ho: µ1= µ2= µ3= µ4 Ha: At least two means differ ANOVA model c y yi = a + bx2 + cx3 + dx4 d b a c=µ1-µ3 a=µ1 d=µ1-µ4 b=µ1-µ2

One-way ANOVA Number of groups: k = 4 X Number of observations N = 19 Grand mean = 7.80 Sum of squares (SS): deviance SS Total = Σ(yi– grand mean)2 SS Factor = Σ ni(group meani – grand mean)2 SS Error (within group) = Σ(yi – group meani)2 Degree of freedom (df) Total: N – 1 Group: k – 1 Error: N – k

One-way ANOVA: SS explanation SS Total SS Error Grand mean mean3 SS Factor mean4 SS Total = SS Factor +SS Error mean2 mean1

One-way ANOVA SSTotal=SSFactor + SSError SSTotal=SSFactor µ3 µ4 Grand mean µ2 µ1 SS can be divided by the respective df to get a variance MSFactor MS = SS /df Mean squared deviation MSError The pseudo-replication would work here!!!

One-way ANOVA: F test (variance) If the k means are not equal, then the Factor MS in the population will be greater than the population’s Error MS Factor MS Error MS F = How to define the correct F test can be a difficult task with complex design (BE EXTREMELY CAREFUL!!!!) If F calculated is large (e.g. P<0.05), then we can reject Ho All we conclude is that at least two means are different!!! A POSTERIORI MULTIPLE COMPARISONS WORKING WITH CONTRASTS

One-way ANOVA: contrasts Contrasts are the essence of hypothesis testing and model simplification in analysis of variance and analysis of covariance. They are used to compare means or groups of means with other means or groups of means We used contrasts to carry out t test AFTER having found out a significant effect with the F test - We can use contrasts in model simplification (merge similar factor levels) - Often we can avoid post-hoc multiple comparisons

One-way ANOVA: multiple comparisons If F calculated > F critic, then we can reject Ho At least two means are different!!! A POSTERIORI MULTIPLE COMPARISONS (lots of methods!!!) Multiple comparison procedures are then used to determine which means are different from which. Comparing K means involves K(K − 1)/2 pairwise comparisons. E.g. Tukey-Cramer, Duncan, Scheffè…

One-way ANOVA: nonparametric If the assumptions are seriously violated, then one can opt for a nonparametric ANOVA However One-way ANOVA is quite robust even in condition of non-normality and non-homogeneity of variance Kruskal-Wallis test (if k = 2, then it corresponds to the Mann-Whitney test) ANOVA by ranks If there are tied ranks a correction term must be applied kruskal.test()

Multi-way ANOVA Multi-way ANOVA is used when the experimenter wants to study the effects of two or more treatment variables. ASSUMPTIONS Independence of cases - this is a requirement of the design Normality - the distributions in each of the groups are normal Homogeneity of variances - the variance of data in groups should be the same + Equal replication (BALANCED AND ORTHOGONAL DESIGN) X X If you use traditional general linear models just one missing data can affect strongly the results

Fixed vs. random factors If we consider more than one factor we have to distinguish two kinds of effects: Fixed effects: factors are specifically chosen and under control, they are informative (E.g. sex, treatments, wet vs. dry, doses, sprayed or not sprayed) Random effects: factors are chosen randomly within a large population, they are normally not informative (E.g. fields within a site, block within a field, split-plot within a plot, family, parent, brood, individuals within repeated measures) • Random effects mainly occur in two contrasting kinds of circumstances • Observational studies with hierarchical structure • Designed experiments with different spatial or temporal dependence

Fixed vs. random factors Why is it so important to identify fixed vs. random effects? They affect the way to construct the F-test in a multifactorial ANOVA. Their false identification leads to wrong conclusions You can find how to construct your F-test with different combinations of random and fixed effects and with different hierarchical structures (choose a well-known sampling design!!!) If we have both fixed and random effects, then we are working on MIXED MODELS yi = µ + αi (fixed) + ri (random) + ε

Factorial ANOVA: two or more factors Factorial design: two or more factors are crossed. Each combination of the factors are equally replicated and each factor occurs in combination with every level of the other factors 3 levels of irrigation 4 fertilizer 10 10 10 Orthogonal sampling 10 10 10 10 10 10 10 10 10

Factorial ANOVA: Why? • Why use a factorial ANOVA? Why not just use multiple one-way ANOVA’s? With n factors, you’d need to run n one-way ANOVA’s, which would inflate your α-level • However, this could be corrected with a Bonferroni correction • The best reason is that a factorial ANOVA can detect interactions, something that multiple one-way ANOVA’s cannot do

Factorial ANOVA: Interactions Interaction: When the effects of one independent variable differ according to levels of another independent variable E.g. We are testing two factors, Gender (male and female) and Age (young, medium, and old) and their effect on performance If males performance differed as a function of age, i.e. males performed better or worse with age, but females performance was the same across ages, we would say that Age and Gender interact, or that we have an Age x Gender interaction Female Performance Male Young Old Age It is necessary that the slopes differ from one another

Factorial ANOVA: Main effects Main effects: the effect of a factor is independent from any other factors This is what we were looking at with one-way ANOVA’s – if we have a significant main effect of our factor, then we can say that the mean of at least one of the groups/levels of that factor is different than at least one of the other groups/levels Performance Performance Young Old Male Female It is necessary that the intercepts differ

Factorial ANOVA: Two-crossed fixed factor design Examples of ‘good’ ANOVA results Worst case Two-crossed factor design CloneA CloneB Treatment: 3 levels

Factorial ANOVA: Two-crossed factor design Two crossed fixed effects: every level of each factor occurs in combination with every level of the other factors Model 1: two fixed effects Model 2: two random effects (uncommon situation) Model 3: one random and one fixed effect • We can test main effects and interaction: • The main effect of each factor is the effect of each factor independent of (pooling over) the other factors • 2. The interaction between factors is a measure of how the effects of one factor depends on the levels of one or more additional factors (synergic and antagonist effect of the factors) Factor 1 x Factor 2 We can only measure interaction effects in factorial (crossed) designs

Factorial ANOVA: Two-crossed fixed factor design Two crossed fixed effects: Response variable: weight gain in six weeks Factor A: DIET (3 levels: barley, oats, wheat) Factor B: SUPPLEMENT (4 levels: S1, S2, S3, S4) DIET* SUPPLEMENT= 3 x 4 = 12 combinations We have 48 horses to test our two factors: 4 replicates barley+S2 oats+S2 wheat+S2 barley+S1 oats+S1 wheat+S1 barley+S4 oats+S4 wheat+S4 barley+S3 oats+S3 wheat+S3

Factorial ANOVA: Two-crossed fixed factor design The 48 horses must be independent units to be replicates Barley 26.34 23.29 22.46 25.57 Oats 23.29 20.49 19.66 21.86 wheat 19.63 17.40 17.01 19.66 DIET SUPPLEMENT DIET*SUPPLEMENT Barley Oats Wheat F test for main effects and interaction DIET SUPPLEMENT S1 S2 S3 S4 DIET*SUPPLEMENT

MIXED MODELS: Fixed+random effects Pure fixed effect models REQUIRE INDEPENDENCE WHAT IF WE HAVE DEPENDENCE??? Mixed models can deal with spatial or temporal dependence The mixed models included the dependence in the data with appropriate random effects

MIXED MODELS: Split-plotand We can consider random factors to account for the variability related to the environment in which we carry out the experiment Mixed models can deal with spatial or temporal dependence The split-plot design is one of the most useful design in agricultural and ecological research The different treatment are applied to plot with different size organized in a hierarchical structure

Factorial ANOVA: split-plot (mixed model) A B C D Block Irrigation Density Fertilizer Response variable: Crop production in each cell Fixed effects: Irrigation (yes or no) Seed-sowing density sub-plots (low, med, high) Fertilizer (N, P or NP) Random effects: 4 blocks Irrigation within block Density within irrigation plots

Factorial ANOVA: mixed model HIERARCHICAL STRUCTURE The split-plot design i) 4 Blocks No water Water ii) irrigation plot (yes or no) iii) seed-sowing density sub-plots (low, med, high) iv) 3 fertilizer sub-sub-plot (N, P, o NP) Crop production

Factorial ANOVA: mixed model aov() Model formulation Y~ fixed effects + error terms Here you can specify your sampling hierarchy y ~ a*b*c + Error(a/b/c) Block Irrigation Density Fertilizer Uninformative Yield Informative!!! Yield ~ irrigation*density*fertilizer+ Error(block/irrigation/density))

Mixed models: tradition vs. REML Mixed models using traditional ANOVA requires perfect orthogonal and balanced design (THEY WORK WELL WITH THE PROPER SAMPLING) avoid to work with multi-way ANOVA in non-orthogonal sampling designs If something has gone wrong with the sampling In R you can run Mixed models with missing data and unbalanced design (non orthogonal design) using the REML estimation lme4 or NLME

REML: Residual Maximum Likelihood vs. Maximum Likelihood Unbalance, non-orthogonal, multiple sources of error Mixed model: REML • Packages • NLME (quite old) • New Alternative • lme4

Mixed model: REML When should I use REML? For balanced data, both ANOVA and REML will give the same answer. However, ANOVA’s algorithm is much more efficient and so should be used whenever possible.

Generalized Linear Models (GLM) We can use GLMs when the variance is not constant, and/or when the errors are not normally distributed. A GLM has three important properties 1. The error structure 2. The linear predictor 3. The link function

Generalized Linear Models (GLM) Error structure In GLM we can specify error structure different from the normal: - Normal (gaussian) - Poisson errors (poisson) - Binomial errors (binomial) - Gamma errors (Gamma) glm(formula, family = …, link=…, data, …)

Generalized Linear Models (GLM) Linear predictor (η)= predicted value output from a GLM The model relates each observed y to a predicted value The predicted value is obtained by a TRANSFORMATION of the value emerging from the linear predictor β are the parameters estimated for the p explanatory variables x are the values measured for the p explanatory variables Fit of the model = comparison between the linear predictor and the TRANSFORMED y measured The transformation is specified in the LINK FUNCTION

Generalized Linear Models (GLM) Link functions (g) The link function related the mean value of y to its linear predictor The value of η is obtained by transforming the value of y by the link function The predicted value of y is obtained by applying the inverse link function to the η value of y by the link function Typically the output of a GLM is η Need to transform η to get the predicted values Known the distribution start with the canonical link function

Generalized Linear Models (GLM) glm() Model fit Deviance is the measure of goodness-of-fit of GLM Deviance=-2(log-likelihoodcurrent model -log-likelihood saturated model) We aim at reducing the residual deviance n is the size of the sample

Proportion data and binomial error Proportion data and binomial error 1. The data are strictly bounded (0-1) 2. The variance is non-constant (∩-shaped relation with the mean) 3. Errors are non-normal Link function: Logit

Proportion data and binomial error After fitting a binomial-logit GLM must be: Residual df ≈ residual deviance NO YES Fit a quasibinomial The model is adequate Check again NO YES The model is adequate Change distribution 2 Examples 1. The first example concerns sex ratios in insects (the proportion of all individuals that are males). In the species in question, it has been observed that the sex ratio is highly variable, and an experiment was set up to see whether population density was involved in determining the fraction of males. 2. The data consist of numbers dead and initial batch size for five doses of pesticide application, and we wish to know what dose kills 50% of the individuals (or 90% or 95%, as required).

Count data and Poisson error Count data and Poisson error 1. The data are non-negative integers 2. The variance is non-constant (variance = mean) 3. Errors are non-normal The model is fitted with a log link (to ensure that the fitted values are bounded below) and Poisson errors (to account for the non-normality).

Count data and Poisson error After fitting a Poisson-log GLM must be: Residual df ≈ residual deviance NO YES Fit a quasipoisson The model is adequate Check again YES NO The model is adequate Change distribution Example 1. In this example the response is a count of the number of plant species on plots that have different biomass (continuous variable) and different soil pH (high, mid and low).

Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students