Treatment Comparisons

Treatment Comparisons • ANOVA can determine if there are differences among the treatments, but what is the nature of those differences? • Are the treatments measured on a continuous scale? • Look at response surfaces (linear regression, polynomials) • Is there an underlying structure to the treatments? • Compare groups of treatments using orthogonal contrasts or a limited number of preplanned mean comparison tests • Use simultaneous confidence intervals on preplanned comparisons • Are the treatments unstructured? • Use appropriate multiple comparison tests (today’s topic)

Variety Trials • In a breeding program, you need to examine large numbers of selections and then narrow to the best • In the early stages, based on single plants or single rows of related plants. Seed and space are limited, so difficult to have replication • When numbers have been reduced and there is sufficient seed, you can conduct replicated yield trials and you want to be able to “pick the winner”

Comparison of Means • Pairwise Comparisons • Least Significant Difference (LSD) • Simultaneous Confidence Intervals • Tukey’s Honestly Significant Difference (HSD) • Dunnett Test (making all comparisons to a control) • May be a one-sided or two-sided test • Bonferroni Inequality • Scheffé’s Test – can be used for unplanned comparisons • Other Multiple Comparison Tests - “Data Snooping” • Fisher’s Protected LSD (FPLSD) • Student-Newman-Keuls test (SNK) • Waller and Duncan’s Bayes LSD (BLSD) • False Discovery Rate Procedure • Often misused - intended to be used only for data from experiments with unstructured treatments

Multiple Comparison Tests • Fixed Range Tests – a constant value is used for all comparisons • Application • Hypothesis Tests • Confidence Intervals • Multiple Range Tests – values used for comparison vary across a range of means • Application • Hypothesis Tests

Type I vs Type II Errors • Type I error - saying something is different when it is really the same (false positive) (Paranoia) • the rate at which this type of error is made is the significance level  • Type II error - saying something is the same when it is really different (false negative) (Sloth) • the probability of committing this type of error is designated b • the probability that a comparison procedure will pick up a real difference is called the power of the test and is equal to 1-b • Type I and Type II error rates are inversely related to each other • For a given Type I error rate, the rate of Type II error depends on • sample size • variance • true differences among means

Nobody likes to be wrong... • Protection against Type I is choosing a significance level • Protection against Type II is a little harder because • it depends on the true magnitude of the difference which is unknown • choose a test with sufficiently high power • Reasons for not using LSD to make all possible comparisons • the chance for a Type I error increases dramatically as the number of treatments increases

Pairwise Comparisons • Making all possible pairwise comparisons among t treatments • # of comparisons: • If you have 10 varieties and want to look at all possible pairwise comparisons • that would be t(t-1)/2 or 10(9)/2 = 45 • that’s quite a few more than t-1 df = 9

Comparisonwise vs Experimentwise Error • Comparisonwise error rate ( = C) • measures the proportion of all differences that are expected to be declared real when they are not • Experimentwise error rate (E) • the risk of making at least one Type I error among the set (family) of comparisons in the experiment • measures the proportion of experiments in which one or more differences are falsely declared to be significant • the probability of being wrong increases as the number of means being compared increases • Also called familywise error rate (FWE)

Comparisonwise vs Experimentwise Error • Experimentwise error rate (E) Probability of no Type I errors = (1-C)x where x = number of pairwise comparisons Max x = t(t-1)/2 , where t=number of treatments  Probability of at least one Type I error E = 1- (1-C)x • Comparisonwise error rate C = 1- (1-E)1/x if t = 10, Max x = 45 E = 1-(1-0.05)45 = 90%

Least Significant Difference • Calculating a t for testing the difference between two means • Any difference for which the tcalc> t would be declared significant • Further, is the smallest difference for which significance would be declared • Therefore • For equal replication, where r is the number of observations forming each mean

Do’s and Don’ts of using LSD • LSD is a valid test when • Making comparisons planned in advance of seeing the data (this includes the comparison of each treatment with the control) • Comparing adjacent ranked means • The LSD should not (unless F test for treatments is significant**) be used for • Making all possible pairwise comparisons • Making more comparisons than df for treatments **Some would say that LSD should never be used unless the F test from ANOVA is significant

Pick the Winner • A plant breeder wanted to measure resistance to stem rust for six wheat varieties • planted 5 seeds of each variety in each of four pots • placed the 24 pots randomly on a greenhouse bench • inoculated with stem rust • measured seed yield per pot at maturity

Ranked Mean Yields (g/pot) Mean Yield Difference Variety Rank F 1 95.3 D 2 94.0 1.3 E 3 75.0 19.0 B 4 69.0 6.0 A 5 50.3 18.7 C 6 24.0 26.3

ANOVA • Compute LSD at 5% and 1% Source df MS F Variety 5 2,976.44 24.80 Error 18 120.00

Back to the data... LSD=0.05 = 16.27 LSD=0.01 = 22.29 Mean Yield Difference Variety Rank F 1 95.3 D 2 94.0 1.3 E 3 75.0 19.0* B 4 69.0 6.0 A 5 50.3 18.7* C 6 24.0 26.3**

Fisher’s protected LSD (FPLSD) • Uses comparisonwise error rate • Computed just like LSD but you don’t use it unless the F for treatments tests significant • So in our example data, any difference between means that is greater than 16.27 is declared to be significant

Tukey’s Honestly Significant Difference (HSD) • From a table of Studentized range values (see handout), select a value of Qa which depends on p (the number of means) and v (error df) • Compute: • For any pair of means, if the difference is greater than HSD, it is significant • Uses an experimentwise error rate • Use the Tukey-Kramer test with unequal sample size

Student-Newman-Keuls Test (SNK) • Rank the means from high to low • Compute t-1 significant differences, SNKj , using the studentized values for the HSD • Compare the highest and lowest • if less than SNK, no differences are significant • if greater than SNK, compare next highest mean with next lowest using next SNK • Uses experimentwise for the extremes • Uses comparisonwisefor adjacent means • where j=1,2,..., t-1; k=2,3,...,t • k = number of means in the range

Using SNK with example data: k 2 3 4 5 6 Q 2.97 3.61 4.00 4.28 4.49 SNK 16.27 19.77 21.91 23.44 24.59 Mean Yield Variety Rank F 1 95.3 D 2 94.0 E 3 75.0 B 4 69.0 A 5 50.3 C 6 24.0 5 4 3 2 1 =15 comparisons 18 df for error SNK=Q*se

Duncan’s New Multiple-Range Test • Critical value varies depending on the number of means involved in the test Alpha 0.05 Error Degrees of Freedom 6 Error Mean Square 113.0833 Number of Means 2 3 4 5 6 Critical Range 26.02 26.97 27.44 27.67 27.78 Means with the same letter are not significantly different. Duncan Grouping Mean N variety A 95.30 2 6 A A 94.00 2 4 A B A 75.00 2 5 B A B A 69.00 2 2 B B 50.30 2 1 C 22.50 2 3

BLSD = t 2MSE/r Waller-Duncan Bayes LSD (BLSD) • Do ANOVA and compute F (MST/MSE) with q and f df (corresponds to table nomenclature) • Choose error weight ratio, k • k=100 corresponds to 5% significance level • k=500 for a 1% test • Obtain tb from table (A7 in Petersen) • depends on k, F, q (treatment df) and f (error df) • Compute • Any difference greater than BLSD is significant • Does not provide complete control of experimentwise Type I error • Reduces Type II error

Bonferroni Inequality • Theory E X * C where X = number of pairwise comparisons • To get critical probability value for significance C= E / X where E = maximum desired experimentwise error rate • Alternatively, multiply observed probability value by X and compare to E(values >1 are set to 1) • Advantages • simple • strict control of Type I error • Disadvantage • very conservative, low power to detect differences

False Discovery Rate • Bars show P values for simple t tests among means • Largest differences have the smallest P values • Line represents critical P values = (i/X)* E Reject H0 i = 1 to X Ranks for -

Most Popular • FPLSD test is widely used, and widely abused • BLSD is preferred by some because • It is a single value and therefore easy to use • Larger when F indicates that the means are homogeneous and small when means appear to be heterogeneous • The False Discovery Rate (FDR) has nice features • Good experimentwise Type I error control • Good power (Type II error control) • May not be as well-known as some other tests • Tukey’s HSD test • Widely accepted and often recommended by statisticians • May be too conservative if Type II error has more serious consequences than Type I error

Treatment Comparisons