220 likes | 600 Views
The Chi-Square Test. How well does it fit the facts ? In many cases can be answered by the -test.The test was invented in 1900 by Karl PearsonThe test is used when there are more than two categories of data; Like the probabilities of A, C, G, T in two DNA sequences, to check wheth
E N D
1. Testing statistical hypotheses: The Chi-square test and how Sir R.A. Fisher caught Mendel cheating
2. The Chi-Square Test How well does it fit the facts ? In many cases can be answered by the -test.
The test was invented in 1900 by Karl Pearson
The test is used when there are more than two categories of data;
Like the probabilities of A, C, G, T in two DNA sequences, to check whether these categories are equally likely.
3. A gambler is accused of using a loaded die but he pleads innocent. A record has been kept for the last 60 throws.
4 3 3 1 2 3 4 6 5 6
2 4 1 3 3 5 3 4 3 4
3 3 4 5 4 5 6 4 5 1
6 4 4 2 3 3 2 4 4 5
6 3 6 2 4 6 4 6 3 2
5 4 6 3 3 3 5 3 1 4
If the gambler is innocent, the numbers from the table should be like 60 random drawings with replacement from a box with {1,2,3,4,5,6}. Each number should show up about 10 times.
The expected frequency is 10.
4. Observed Frequencies Value Observed Freq Expected Freq
1 4 10
2 6 10
3 17 10
4 16 10
5 8 10
6 9 10
5. The statistic
6. The P-value: the observed significance level We need to know the chance that when a fair die is rolled 60 times and is computed from the observed frequencies, its value turns out to be 14.2 or more.
The answer P=1.4% That is, if the die is fair there is 1.4% chance for the statistic to be as big as or bigger than the observed one.
Conclusion: The gambler is in trouble!!!
7. Degrees of freedom Pearson invented to curves one curve for each degree of freedom.
In our case, the model is fully specified, i.e., there is no parameter to estimate from data so
degrees of freedom = number of terms in - 1
8. The -test P-value For the -test the P-value is approximately
equal to the area to the right of the observed
value for the statistic, under the -curve
with the appropriate number of degrees of
freedom.
9. P= area under curve c
10. Rule of thumb The approximation given by the curve can be trusted when the expected frequency in each line of the table is 5 or more.
11. Is Mendels experimental data too good to be true ? Yes! In 1865 Gregor Mendel published an article in which he provided a scientific explanation for heredity, and eventually caused a revolution in biology.
Mendels experiments were all performed on garden peas. Pea seeds are either yellow or green. Color is a property of the seed.
Mendel bred a pure yellow strain, that is a strain in which every plant in every generation had only yellow seeds; and separately he bred a pure green strain.
12. Yellow and Green peas He then crossed plants of the pure yellow with the plants of pure green
The seeds resulted from a yellow-green cross and the resulting plants are called first-generation hybrids.
First-generation hybrid seeds are all yellow, indistinguishable from seeds of the pure yellow strain. The green seems to have disappeared completely.
These first-generation hybrid seeds grew into first-generation hybrid plants which Mendel crossed with themselves, producing second-generation hybrid seeds. Some of these second generation seeds were yellow, but some were green.
So the green disappeared for one generation but reappeared in the second. Even more surprising, the green reappeared in a simple proportion:
Of the second generation hybrids 75% were yellow and 25% were green
13. Factors, aka genes To explain it, Mendel postulated the existence of factors later called genes.
According to Mendels theory, there were two different variants of a gene which paired up to control seed color. Denoted Y and G. It is the gene pair in the seed not the parent which determines what color the seed will be, all the cells making up a seed contain the same gene-pair
14. Y is dominant There are four different gene-pairs:
Y/Y, Y/G, G/Y, G/G
Gene pairs control seed color by the rule:
Y/Y, Y/G,G/Y make yellow
G/G makes green
As geneticists say,
Y is dominant and
G is recessive
15. Randomness Seed grows and become a plant. All cells in this plant also carry the seeds color gene-pair. With one exception: Sex cells, either sperm or eggs, contain only one gene of the pair.
For example, a plant whose ordinary gene pair is Y/Y will produce sperm cell each containing a gene Y; similarly it will produce egg cells each containing gene Y.
One plant whose pair is Y/G will produce half of its sperm cells containing Y and half containing G. The same is true of the eggs cells.
16. First generation model explanation Plants of pure yellow have the color pair Y/Y
Plants of pure green have the color pair G/G
Crossing a pure yellow with a pure green is producing fertilized egg of Y/G gene pair; this cell reproduces itself and eventually becomes a seed, in which all the cells have the gene-pair Y/G and are yellow in color.
17. Second generation model explanation A first generation hybrid seed grows into a first generation hybrid plant with gene-pair Y/G. This plant produces sperm cells of which half will contain the gene Y and the other half will contain the gene G; it also produces eggs of which half will be Y and half G.
When two first generation hybrids are crossed, each resulting second-generation hybrid seed gets one gene at random from each parent -- because it is formed by the random combination of a sperm cell and an egg.
18. Mendels chance model: He was right!
19. Did Mendels facts fit his model ? Only too well answered R. A. Fisher
20. How Fisher used the test to show that: Mendel was cheating For each of Mendels experiments, Fisher
computed the statistic. These experiments
were all independent, for they involved different
sets of plants. And Fisher pooled the results.
21. Too good to be true For example if one experiment gives = 5.8 with 5
degrees of freedom, and another independent experiment
gives = 3.1 with 2 degrees of freedom, the two
together have a pooled = 8.9 with 7 degrees of
freedom. For Mendels data, Fisher got a pooled under
42, with 84 degrees of freedom. The area under the left of
42 under the curve with 84 degrees of freedom is
about 4 in 100,000. The agreement between the observed
and expected is too good to be true.
22. What does it mean ? Suppose million of scientists were repeating Mendels experiments. For each scientist, imagine measuring the discrepancy between his observed frequencies and the expected frequencies by the statistic. Then by the laws of chance, about 99,996 out of every 100,000 of these scientists would report a discrepancy between observations and expectations greater than the one reported by Mendel. That leaves two possibilities.
(1) Either Mendels data were massaged
(2) Or he was pretty lucky ?
The first is easier to believe.
23. Using chi-square test To test whether the null hypothesis that the prescribed probabilities for the
nucleotides of a sequence are
for i=1,2,3,4 (aka A, C, G, T)
we apply the test
If the observed values are such that
is large that we reject the null hypothesis.
is the number of nucleotides in category in our sequence.
The formula for is a measure of discrepancy between the observed values and
the respective null hypothesis means
When the null hypothesis is true and large we have a chi-square distribution with
4-1=3 degrees of freedom.