150 likes | 305 Views
BIOL 582. Lecture Set 16 Analysis of frequency and categorical data Part I: Goodness of Fit Tests. Until this point, we have concerned ourselves mainly with continuous quantitative response data, somewhat with discrete data that behave as if continuous, and rarely, categorical response data
E N D
BIOL 582 Lecture Set 16 Analysis of frequency and categorical data Part I: Goodness of Fit Tests
Until this point, we have concerned ourselves mainly with continuous quantitative response data, somewhat with discrete data that behave as if continuous, and rarely, categorical response data • The emphasis for this lecture topic is how to analyze response data that tend to fall in categories. • Analyses such as these are best motivated by examples • The examples, nomenclature, and coverage of topic pretty much follows Chapter 17 from Sokal and Rohlf (2011) Biometry, 4th Edition (but excludes some additional detail that the book goes into).
An example of simple Mendelian genetics • A phenotypic trait that exhibits genetic dominance: • White and brown fuzzy bunnies • Let the gene for coat color be denoted by alleles B (brown) or b (not brown) • There are three genotypes possible • BB (brown), Bb (brown), and bb (white) • If a monohybrid cross is performed – i.e., two heterozygous brown bunnies (Bb) are mated – the possible offspring produced would have one of the following genotypes • Realizing that bB and Bb are the same thing, the expected genotype frequency of offspring is 1:2:1 for BB:Bb:bb. The expected phenotype frequency is 3:1, brown:white, because of genetic dominance
Some heterozygous brown bunnies are mated and 100 offspring are born, 89 brown and 11 white • Does this result defy expectation? • Solution 1: binomial probability distribution get an exact p-value • Where n is the number of subjects and k is the number of subjects with some specified value (e.g., brown color). π is the expected portion of subjects with the specified value; 1 - π is the expected portion without the value. For our example, • Note that this is the probability of finding exactly 89 brown bunnies of 100, when the process should produce 75. It is probably better to find • As this is the probability of finding at least 89 brown bunnies when the process should produce 75. This returns a P-value of 0.0003935178.
Some heterozygous brown bunnies are mated and 100 offspring are born, 89 brown and 11 white • Does this result defy expectation? • Solution 1: binomial probability distribution get an exact p-value • Using R…. > # For probability that k = 89, when n = 100 > > pbinom(89,100,0.75,lower.tail=T) - pbinom(88,100,0.75,lower.tail=T) [1] 0.0002564172 > # or > > pbinom(88,100,0.75,lower.tail=F) - pbinom(89,100,0.75,lower.tail=F) [1] 0.0002564172 > # Note that R uses cumulative probability function > # For probability that k >= 89, when n = 100 > > 1- pbinom(88,100,0.75,lower.tail=T) [1] 0.0003935178 > # or > pbinom(88,100,0.75,lower.tail=F) [1] 0.0003935178
Some heterozygous brown bunnies are mated and 100 offspring are born, 89 brown and 11 white • Does this result defy expectation? • Solution 2: “Chi-square” test (Note, this is really a bad name since a Chi-square distribution is based on continuous frequencies and the test statistic calculated in the following example – as you probably learned in a genetics class – is a sample statistic calculated from discrete frequencies. However, the statistic approximately follows a Chi-square distribution.) To denote Theoretical distribution To denote Sample statistic f is the observed frequency for category i; f-hat is the expected frequency, found as πn or (π-1)n, depending on whether the expected frequency corresponds to the specified category or the unspecified category In the example, 0.75*100 = 75 is the expected number of brown bunnies; 0.25*100 = 25 is the expected number of white bunnies For the example,
Some heterozygous brown bunnies are mated and 100 offspring are born, 89 brown and 11 white • Does this result defy expectation? • Solution 3: “G” (Goodness of Fit) test This equation can also be written as To denote Theoretical distribution The interior part is a likelihood ratio, which approximates the ratio of binomial probabilities for π and k/n. For the example,
Some heterozygous brown bunnies are mated and 100 offspring are born, 89 brown and 11 white • Does this result defy expectation? • Summary of solutions • Why not always use binomial probability? • Expected frequencies might not be known but a reference distribution could be used • Why did the G test (in this case) have more statistical power? • Although G test and “Chi-square” test approximately follow a chi-square distribution of the same df, the G test is known to follow it more closely (produces values consistent with theoretical distribution). G test is also a likelihood ratio test, and will have some better properties for more complicated examples (as we will see). In general, the two produce similar results (especially with large sample sizes). Both are also susceptible to problems with small sample size, but G is better. Rule of Thumb: use G when |O - E| > E for any (O)bserved and (E)xpected values.
Example set-up • Ecologists are often interested in whether species diversity at local scales differs from regional scales • If one were to sample species in a local area, would the sample be comprised of the same species in the same proportion as is found in the region? • One can substitute taxonomic affinities for the following “expected” regional species proportions: Species A 50%, Species B 22%, Species C 16%, Species D 9%, Species E 1%, Species F 0.5%, Species G, H, I, J “Pooled” 1.5% • A scientist collects the following numbers of species in a sampling event of a local place (e.g., pond, lake, river, prairie fragment, sinkhole, etc.) – See next page • Question: is local species diversity the same as regional species diversity?
Example set-up • Knowing that 0s values can cause problems for the likely statistical tests, and also being constrained to pool species, G, H, I and J already, the scientist summarizes the data as below
Example analysis • Note: There is no way to calculate the binomial probability, as these are not binomial data. But Goodness of fit tests can still be applied.
Example analysis • Note: There is no way to calculate the binomial probability, as these are not binomial data. But Goodness of fit tests can still be applied. Just add a few columns… df= a -1 for a classes
Example analysis • Note: There is no way to calculate the binomial probability, as these are not binomial data. But Goodness of fit tests can still be applied. Just add a few columns… • Note that there is a correction factor for these tests, as type I error rates tend to be higher than the intended levels. These are more substantial for small sample sizes. • For G test, divide G by: (William’s Correction) • For X 2 or G test, stats are adjusted by adding or subtracting 0.5 from observed frequencies, whichever is more limiting(Correction for Continuity) • With Chi-square test, the correction for continuity is easy to perform by taking the absolute difference of observed and expected frequencies and subtracting ½ before squaring. • Williams correction for example
Example conclusion • This mythical place has many more rare species than expected, compared to regional species pool • Note that the pooling of species, F, G, H, I, and J might influence the outcome
The first two examples included one frequency distribution and some known or true expectation. • The first two examples included categorical data • There are two different ways we can (and will) go • 1. Goodness of Fit tests for continuous frequency data • 2. Goodness of fit tests for more than one distribution • We have to start with one of these, so let’s start with 1. • Before proceeding, it is important to establish two different hypotheses that are used as “null” model for frequency expectations. In the previous two examples, the expected frequencies were established by theory (expected genotypes) or a larger empirical pool of information (species proportions). These are extrinsic hypotheses for the basis of expected frequencies. Intrinsic hypotheses can also be used for estimating frequencies. For example, if we wish to test if a continuous frequency distribution is normal, we can generate expected frequencies but would first need to know the mean and variance of the sample. Thus, the degrees of freedom for the test are reduced by these additional parameter estimates.