740 likes | 1.01k Views
Topic 8 – One-Way ANOVA. Single Factor Analysis of Variance Reading: 17.1, 17.2, & 17.5 Skim: 12.3, 17.3, 17.4. Overview. Categorical Variables (Factors) Fixed vs. Random Effects Review: Two-sample T-test ANOVA as a generalization of the two-sample T-test
E N D
Topic 8 – One-Way ANOVA Single Factor Analysis of Variance Reading: 17.1, 17.2, & 17.5 Skim: 12.3, 17.3, 17.4
Overview • Categorical Variables (Factors) • Fixed vs. Random Effects • Review: Two-sample T-test • ANOVA as a generalization of the two-sample T-test • Cell-Means and Factor-Effects ANOVA Models (same model, different form)
Terminology: Factors & Levels • The term factor is generally used to refer to a categorical predictor variable. • Blood Type • Gender • Drug Treatment • Other Examples? • The term levels is used to refer to the specific categories for a factor. • A / B / AB / O (could also consider +/-) • Male / Female
Factors: Fixed or Random? • A factor is fixed if the levels under consideration are the only ones of interest. • The levels of the factor are selected by a non-random process AND are the only levels of interest. • For the time being, all factors that we will consider will be fixed. • Examples?
Factors: Fixed or Random? (2) • A factor is random if the levels under consideration may be regarded as a sample from a larger population. • Not all levels of interest are included in the study – only a random sample. • We want to inferences to be applicable to the entire (larger) population of levels. • Examples? • Analysis is a little more complicated; we’ll save this topic for near the end of the course.
Example: Random or Fixed? To study the effect of diet on cattle, an experimenter randomly (and equally) allocates 50 cows to 5 diets (a control and 4 experimental diets). After 1 year, the cows are butchered and the amount of good meat (in pounds) is measured. • Response = ______________ • Cow = _______ Factor • Diet = _______ Factor
Notation • In general, we label our factors A, B, C, etc. • Factor A has levels i = 1, 2, 3, ..., a • Factor B has levels j = 1, 2, 3, ..., b • Factor C has levels k = 1, 2, 3, ..., c • More on notation later; remember for now we are considering single factor ANOVA, so we will have only a “Factor A”.
Comparing Groups Suppose I want to compare heights between men and women. How would I do this?
Notation for Two-Sample Settings • Suppose an SRS (simple random sample) of size n1 is selected from the 1st population, and another SRS of size n2 is selected from the 2nd population.
Estimating Differences • A natural estimator of the difference is the difference between the sample means: • If we assume that both populations are normally distributed (or CLT applies) then both sample means and their difference will be normally distributed as well. • Because we are estimating standard deviations, a confidence interval for the difference in means uses the T-distribution.
CI for Difference • If variances are unknown, then a 95% confidence interval for difference in means is given by • The critical value is . The degrees of freedom is n1 + n2 – 2.
Test for Difference = 0 • Can also be viewed as a hypothesis test • Test statistic for testing whether the difference is zero: • Compare to critical value used in CI.
Conclusions • If the test statistic is of larger magnitude (ignore sign) than the critical value, we reject the hypothesis • There is a significant difference between the two groups • The same conclusion results if the CI doesn’t contain zero. • If the statistic is smaller (CI does contain zero), we fail to reject the hypothesis • Fail to show a difference between the two groups
Comparison of Several Groups Suppose instead of two groups, we have “a” groups that we wish to compare (where a > 2). Note: In Chapter 17, textbook defines the number of groups as “k”. Remember this is just a letter, and the letter we use really has nothing to do with anything in particular. So I’m using a to correspond (consistently) to Factor A.
Multiple treatment model • With a groups (treatments), then we could do two-sample t-tests. But... • This does not test the equality of all means at once • Multiple tests means we have greater chance of making Type I errors (a Bonferroni correction can get expensive because of the large number of tests). • We usually expect variances to be the same across groups, but it isn’t clear how we should estimate variance with more than two samples.
Multiple treatment model (2) • Analysis of Variance (ANOVA) models provide a more efficient way to compare multiple groups. For example, in a single factor ANOVA, • The Model (or ANOVA) F-test will test the equality of all group means at the same time. • There are methods of doing pairwise comparisons that are much more efficient than Bonferroni. • All observations (from all groups) are used to estimate the overall variance (by MSE).
Three Ways to View ANOVA • Views observations in terms of their group meanscell means model • Views observations as the sum of an overall mean, a deviation from that mean related to the particular group to which the observation belongsfactor effects model • As regression, using indicator variables.
ANOVA Model Cell Means Model
ANOVA • ANOVA is generally viewed as a an extension of the T-test but used for comparisons of three or more population means. • These populations are denoted by the levels of our factor. • Only one variable, but has 3+ levels or groups • Hence we call the means of these levels factor level means or simply cell means.
Cell Means Model • Basic ANOVA Model is: where • Notation: • “i” subscript indicates the level of the factor • “j” subscript indicates observation number within the group
Cell Sizes • For the time being, we will assume that all the cell sizes are the same: • The total sample size will be denoted
Assumptions for fixed effects • Random samples have been selected for each level of the factor. All observations are independent. • Response variable is normally distributed for each population (level) and the population variances are the same. • Hence, independence, normality and constant variance • What happened to linearity?
Robustness • ANOVA procedures are generally robust to minor departures from the assumptions (i.e. minor deviations from the assumptions will not affect the performance of the procedure). • For major departures, transformations of the response variable [e.g. Log(Y)] may help. • Transforming the Factor(IE predictor) in ANOVA doesn’t help because it’s categorical
Components of Variation • Variation between groups gets “explained” by allowing the groups to have different means. • We know this as SSM, SSR, or now SSA! • Variation within groups is unexplained. • We know this as SSE (it stays the same ) • The ratio F = MSM / MSE forms the basis for testing the hypothesis that all group means are the same. (or F = MSA / MSE)
Variation: Between vs. Within • A convenient way to view the SS • SSA is called the “between” SS because it represents variation between the different groups. It is determined by the squared differences between group means and the grand (overall) mean. • SSE is called the “within” SS because it represents variation within groups. It is determined by the squared differences of observations from their group means.
Quick Comment on Notation • DOT indicates “sum” • BAR indicates “average” or “divide by cell/sample size” • is the mean for all observations • is the mean for the observations in Level i of Factor A.
Pictorial Representation GROUP 1 GROUP 2 GROUP 3
SS Breakdown (Algebraic) • Break down difference between observation and grand mean into two parts: BETWEEN WITHIN GROUPS GROUPS
Components of Variation (2) • Of course the individual components would sum to zero, so we must square them. It turns out that all cross-product terms cancel, and we have: BETWEEN WITHIN GROUPS GROUPS
Model F Test (Cell Means) • Null Hypothesis • Alternative Hypothesis
Conclusion • If we reject the null hypothesis, we have shown differences between groups (levels) • Remember it does not tell us which groups are different. Only that at least one group is different from at least one other group! • If we fail to reject the null hypothesis, we have failed to show any significant differences with the ANOVA F test • Unfortunately sometimes if we look a little closer (we’ll do this later) we still might find some differences!
Calculations: A Brief Look • We’ll consider these for only a balanced design (cell sizes all the same n). • The purpose in doing this is not that you memorize formulas, but that you further your conceptual understanding of the sums of squares.
Blood Type Example (1) • Suppose we have 3 observations of a certain response variable for each blood type • Want to construct the ANOVA table
Blood Type Example (2) • We can compute the sample means using SAS:
Blood Type Example (3) • SSA (Between) • At this point, we have a choice – to calculate SSE or SST.
Blood Type Example (5) • DF: 4 – 1 = 3 for Factor A • DF: N – 1 = 11 for Total • DF: 11 – 3 = 8 for Error • Mean Squares:
Blood Type Example (6) • ANOVA Table • F-test is significant, and so we conclude that there is some difference among the means (we just don’t know exactly which means are different).
SAS Coding • Will use PROC GLM with an important addition: CLASS statement • CLASS statement identifies categorical variables for SAS • Note that failure to use CLASS statement for categorical variable will result in: • SYNTAX ERROR if character variable • INAPPROPRIATE ANALYSIS if class levels are numeric
Residual Diagnostics • Very similar to what we did in regression • Normality plot is the same – keep in mind that most of the tests in ANOVA are robust to minor violations of normality (thanks to the CLT). • In constant variance plot, still may see megaphone shape in RESID vs. PRED if non-constant variance is a problem. • In plots against the factor levels (commonly used), would simply see differing vertical spreads (not megaphone, because generally the labels on the horizontal axis are not “ordered”)
Model Estimates • In SAS, using /solution as an option in the MODEL statement of PROC GLM, we can get the parameter estimates for our model. • Unfortunately these are not the cell means!
Cell or Group Means • To get each cell mean or just add the intercept to each parameter estimate
Model Estimates • The reason for this is that there are infinitely many ways to write down the model for ANOVA. • SAS tells us this by saying ALL estimates are “biased”. So what is SAS actually doing?
ANOVA Model Factor Effects Model (Another convenient view)
A simple example • Three groups: Grand Mean