1 / 71

Topic 8 – One-Way ANOVA

Topic 8 – One-Way ANOVA. Single Factor Analysis of Variance Reading: 17.1, 17.2, & 17.5 Skim: 12.3, 17.3, 17.4. Overview. Categorical Variables (Factors) Fixed vs. Random Effects Review: Two-sample T-test ANOVA as a generalization of the two-sample T-test

sheena
Download Presentation

Topic 8 – One-Way ANOVA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 8 – One-Way ANOVA Single Factor Analysis of Variance Reading: 17.1, 17.2, & 17.5 Skim: 12.3, 17.3, 17.4

  2. Overview • Categorical Variables (Factors) • Fixed vs. Random Effects • Review: Two-sample T-test • ANOVA as a generalization of the two-sample T-test • Cell-Means and Factor-Effects ANOVA Models (same model, different form)

  3. Terminology: Factors & Levels • The term factor is generally used to refer to a categorical predictor variable. • Blood Type • Gender • Drug Treatment • Other Examples? • The term levels is used to refer to the specific categories for a factor. • A / B / AB / O (could also consider +/-) • Male / Female

  4. Factors: Fixed or Random? • A factor is fixed if the levels under consideration are the only ones of interest. • The levels of the factor are selected by a non-random process AND are the only levels of interest. • For the time being, all factors that we will consider will be fixed. • Examples?

  5. Factors: Fixed or Random? (2) • A factor is random if the levels under consideration may be regarded as a sample from a larger population. • Not all levels of interest are included in the study – only a random sample. • We want to inferences to be applicable to the entire (larger) population of levels. • Examples? • Analysis is a little more complicated; we’ll save this topic for near the end of the course.

  6. Example: Random or Fixed? To study the effect of diet on cattle, an experimenter randomly (and equally) allocates 50 cows to 5 diets (a control and 4 experimental diets). After 1 year, the cows are butchered and the amount of good meat (in pounds) is measured. • Response = ______________ • Cow = _______ Factor • Diet = _______ Factor

  7. Notation • In general, we label our factors A, B, C, etc. • Factor A has levels i = 1, 2, 3, ..., a • Factor B has levels j = 1, 2, 3, ..., b • Factor C has levels k = 1, 2, 3, ..., c • More on notation later; remember for now we are considering single factor ANOVA, so we will have only a “Factor A”.

  8. Comparing Groups Suppose I want to compare heights between men and women. How would I do this?

  9. Notation for Two-Sample Settings • Suppose an SRS (simple random sample) of size n1 is selected from the 1st population, and another SRS of size n2 is selected from the 2nd population.

  10. Estimating Differences • A natural estimator of the difference is the difference between the sample means: • If we assume that both populations are normally distributed (or CLT applies) then both sample means and their difference will be normally distributed as well. • Because we are estimating standard deviations, a confidence interval for the difference in means uses the T-distribution.

  11. CI for Difference • If variances are unknown, then a 95% confidence interval for difference in means is given by • The critical value is . The degrees of freedom is n1 + n2 – 2.

  12. Test for Difference = 0 • Can also be viewed as a hypothesis test • Test statistic for testing whether the difference is zero: • Compare to critical value used in CI.

  13. Conclusions • If the test statistic is of larger magnitude (ignore sign) than the critical value, we reject the hypothesis • There is a significant difference between the two groups • The same conclusion results if the CI doesn’t contain zero. • If the statistic is smaller (CI does contain zero), we fail to reject the hypothesis • Fail to show a difference between the two groups

  14. Comparison of Several Groups Suppose instead of two groups, we have “a” groups that we wish to compare (where a > 2). Note: In Chapter 17, textbook defines the number of groups as “k”. Remember this is just a letter, and the letter we use really has nothing to do with anything in particular. So I’m using a to correspond (consistently) to Factor A.

  15. Multiple treatment model • With a groups (treatments), then we could do two-sample t-tests. But... • This does not test the equality of all means at once • Multiple tests means we have greater chance of making Type I errors (a Bonferroni correction can get expensive because of the large number of tests). • We usually expect variances to be the same across groups, but it isn’t clear how we should estimate variance with more than two samples.

  16. Multiple treatment model (2) • Analysis of Variance (ANOVA) models provide a more efficient way to compare multiple groups. For example, in a single factor ANOVA, • The Model (or ANOVA) F-test will test the equality of all group means at the same time. • There are methods of doing pairwise comparisons that are much more efficient than Bonferroni. • All observations (from all groups) are used to estimate the overall variance (by MSE).

  17. Three Ways to View ANOVA • Views observations in terms of their group meanscell means model • Views observations as the sum of an overall mean, a deviation from that mean related to the particular group to which the observation belongsfactor effects model • As regression, using indicator variables.

  18. ANOVA Model Cell Means Model

  19. ANOVA • ANOVA is generally viewed as a an extension of the T-test but used for comparisons of three or more population means. • These populations are denoted by the levels of our factor. • Only one variable, but has 3+ levels or groups • Hence we call the means of these levels factor level means or simply cell means.

  20. Cell Means Model • Basic ANOVA Model is: where • Notation: • “i” subscript indicates the level of the factor • “j” subscript indicates observation number within the group

  21. Cell Sizes • For the time being, we will assume that all the cell sizes are the same: • The total sample size will be denoted

  22. Assumptions for fixed effects • Random samples have been selected for each level of the factor. All observations are independent. • Response variable is normally distributed for each population (level) and the population variances are the same. • Hence, independence, normality and constant variance • What happened to linearity?

  23. Robustness • ANOVA procedures are generally robust to minor departures from the assumptions (i.e. minor deviations from the assumptions will not affect the performance of the procedure). • For major departures, transformations of the response variable [e.g. Log(Y)] may help. • Transforming the Factor(IE predictor) in ANOVA doesn’t help because it’s categorical

  24. Components of Variation • Variation between groups gets “explained” by allowing the groups to have different means. • We know this as SSM, SSR, or now SSA! • Variation within groups is unexplained. • We know this as SSE (it stays the same ) • The ratio F = MSM / MSE forms the basis for testing the hypothesis that all group means are the same. (or F = MSA / MSE)

  25. Variation: Between vs. Within • A convenient way to view the SS • SSA is called the “between” SS because it represents variation between the different groups. It is determined by the squared differences between group means and the grand (overall) mean. • SSE is called the “within” SS because it represents variation within groups. It is determined by the squared differences of observations from their group means.

  26. Quick Comment on Notation • DOT indicates “sum” • BAR indicates “average” or “divide by cell/sample size” • is the mean for all observations • is the mean for the observations in Level i of Factor A.

  27. Pictorial Representation GROUP 1 GROUP 2 GROUP 3

  28. SS Breakdown (Algebraic) • Break down difference between observation and grand mean into two parts: BETWEEN WITHIN GROUPS GROUPS

  29. Components of Variation (2) • Of course the individual components would sum to zero, so we must square them. It turns out that all cross-product terms cancel, and we have: BETWEEN WITHIN GROUPS GROUPS

  30. ANOVA Table

  31. Model F Test (Cell Means) • Null Hypothesis • Alternative Hypothesis

  32. Conclusion • If we reject the null hypothesis, we have shown differences between groups (levels) • Remember it does not tell us which groups are different. Only that at least one group is different from at least one other group! • If we fail to reject the null hypothesis, we have failed to show any significant differences with the ANOVA F test • Unfortunately sometimes if we look a little closer (we’ll do this later) we still might find some differences!

  33. Calculations: A Brief Look • We’ll consider these for only a balanced design (cell sizes all the same n). • The purpose in doing this is not that you memorize formulas, but that you further your conceptual understanding of the sums of squares.

  34. SS Calculations(Balanced)

  35. Blood Type Example (1) • Suppose we have 3 observations of a certain response variable for each blood type • Want to construct the ANOVA table

  36. Blood Type Example (2) • We can compute the sample means using SAS:

  37. Blood Type Example (3) • SSA (Between) • At this point, we have a choice – to calculate SSE or SST.

  38. Blood Type Example (4)

  39. Blood Type Example (5) • DF: 4 – 1 = 3 for Factor A • DF: N – 1 = 11 for Total • DF: 11 – 3 = 8 for Error • Mean Squares:

  40. Blood Type Example (6) • ANOVA Table • F-test is significant, and so we conclude that there is some difference among the means (we just don’t know exactly which means are different).

  41. SAS Coding • Will use PROC GLM with an important addition: CLASS statement • CLASS statement identifies categorical variables for SAS • Note that failure to use CLASS statement for categorical variable will result in: • SYNTAX ERROR if character variable • INAPPROPRIATE ANALYSIS if class levels are numeric

  42. Blood Type Example (SAS)

  43. Residual Diagnostics • Very similar to what we did in regression • Normality plot is the same – keep in mind that most of the tests in ANOVA are robust to minor violations of normality (thanks to the CLT). • In constant variance plot, still may see megaphone shape in RESID vs. PRED if non-constant variance is a problem. • In plots against the factor levels (commonly used), would simply see differing vertical spreads (not megaphone, because generally the labels on the horizontal axis are not “ordered”)

  44. Blood Type (QQ Plot)

  45. Blood Type (Residual Plot)

  46. Model Estimates • In SAS, using /solution as an option in the MODEL statement of PROC GLM, we can get the parameter estimates for our model. • Unfortunately these are not the cell means!

  47. Cell or Group Means • To get each cell mean or just add the intercept to each parameter estimate

  48. Model Estimates • The reason for this is that there are infinitely many ways to write down the model for ANOVA. • SAS tells us this by saying ALL estimates are “biased”. So what is SAS actually doing?

  49. ANOVA Model Factor Effects Model (Another convenient view)

  50. A simple example • Three groups: Grand Mean  

More Related