Chapter 11

Chapter 11 Goodness of Fit Tests

Categorical • Observations fall into one of a number of mutually exclusive categories • Binomial distribution ( AB) • multinomial distribution (ABC…) chi-square distribution

Goodness of Fit Tests • To determining whether a sample could have been drawn from a population with a specified distribution • Based on comparison of observed frequencies and expected frequencies under the specified condition.

Concepts • The Binomial Test • The Chi-Square Test for Goodness of Test • Kolomogorov-Smirnov Test • The Chi-Square Test for r x k Contingency Tables

Binomial Test • For data that can be grouped into exactly two categories e.g. male versus female diseased versus healthy • To determine whether the sample proportions of the two categories are what would be expected with a given binomial distribution

Binomial Test • Assumptions • Independent random sample of size n • Two mutually exclusive categories • Actual proportion: p, corresponding hypothesized value: p0 • Hypotheses

Binomial Test • Test statistic X • the number of observations falling into the first category (successes), follow a binomial random variable with B (n, p0) • : observed value of X

Binomial Test • Cumulative binomial distribution Table C1 (p374) • Note that

Binomial Test • 15 of 20 trees have a 1987 growth ring that is less than half the size of other growth rings. Do these support the claim that the severe drought of 1987 in the U.S. have affected the growth rate of the majority of the established trees? • Hypothesis: • Test statistic: X=15, given α=0.05, p0=0.5 • Inferences : the majority of trees have growth rings for 1987 less than half their usual size.

One sample p test • In southeastern Queensland, Pardalotes race A 70% • 18 pardalotes: race A 10 vs race B 8 • Hypothesis: • Test statistic: X=10, given α=0.05, p0=0.7 • inference: no change in the population proportions of the two races.

Normal approximation to Binomial distribution A r.v. X ~ B(n,p) has mean m = np and variance s 2 = np(1-p). If np(1-p) > 3, then X ~ N(m ,s2) But it should be noted that

One sample p test • Pardalotes race A 70% • 180 pardalotes: race A 100 vs race B 80 Hypothesis: • Test statistic: X=100, given α=0.05, p0=0.7 • inference: a significant change in the population proportions of the two races.

Chi-Square test • It is used when there are several categories. • It compare the observed frequencies of a discrete, ordinal, or categorical data set with those of some theoretically expected distribution (e.g. binomial, multinomial.) • It tests whether an observed set of data agree with expected values based on some hypothesis, H0. • The test gives us a Probability of getting such a value if the H0 applies to our data.

Test for goodness of fit • Assumptions • Independent random sample of size n • A set of k mutually exclusive categories • Specified the expected freq for each category • Hypotheses • H0: the observed frequency distribution is the same as the hypothesized frequency distribution • Ha: the observed and hypothesized frequency distributions are different.

Test Statistic and Theory • Test statistic • Observed and expected freq equal  small • Right tailed, approximate chi-square distribution when H0 is true, where • Table C.5. P385 Observed frequency Expected frequency The difference between the observed and expected frequencies

Why chi-square? • Y~B(n,p) success Y (p), failure n-Y (q) • n larger enough Y~N(np,npq) Observed frequency Observed frequency Observed frequency Observed frequency

Why chi-square? • Multinomial • Output A1 A2 …… Ak • Probability p1 p2 …... Pk • Observed Y1 Y2 …… Yk

Example: an F2 population • Mirabilis jalapa, a self-pollination plant • Consider an F2 population in which a single incomplete dominant gene is segregating. • The numbers of the 3 genotypes AA, Aa, aa are counted and we want to know if they are segregating according to Mendel’s law. • i.e. we are testing the null hypothesis (H0) that we have a 1:2:1 ratio.

Analysis • Genotype AA Aa aa Total red pink white • Expected freqs. ¼½¼ 1.0 • Observed Nos.(O) 55 132 53 240 • Expected Nos.(E) 60 120 60 240 • (O-E) -5 12 -7 0

The extrinsic model • Example: are the data of 240 progeny of self-pollination four-o’clock reasonably consistent with the Mendelian model? • Hypotheses: • H0: the data are consistent with a Mendelian model. • Ha: the data are inconsistent with a Mendelian model. • Calculate expected frequencies • Test statistic: Conclusion: The data support for the Mendelian genetic model.

The intrinsic model • Example: does the number of landfalling hurricanes/year in 1900-1997 in U.S. follow a Poisson distribution? • Hypotheses: • H0: the annual number of U.S. landfalling hurricanes follow a Poisson distribution. • Ha: the annual number of U.S. landfalling hurricanes is inconsistent with a Poisson distribution • Estimate parameters:

count probability 0 0.198 18 19.40 1 0.320 34 31.36 2 0.260 24 25.48 3 0.140 16 13.72 4 0.057 3 5.59 5 0.018 1 1.76 >=6 0.007 2 0.69 <5

Kolomogorov-Smirnov Test • To determine whether a sample could come from a population with a particular specified distribution • -square: for discrete or categorical data • Kolomogorov-Smirnov test: for random samples from continuous (Normal) or discontinuous (Binomial) population.

The arm lengths (radii) of 67 Edmonds sea stars at Polka Point Sufficiently close to normal distribution

Kolomogorov-Smirnov Test • Assumptions • Random sample of size n with some unknown distribution function G(x) • Specified the hypothesized distribution as F(x) • Hypotheses • H0: G(x)=F(x) for all x • Ha: G(x)≠F(x) for at least one value of x

Kolomogorov-Smirnov Test Intervals of data range Observed CDF expected CDF • Statistic: The largest absolute value of the differences between the cumulative distribution of the sample and the expected distribution. • K0, accept H0 • Table C14 The sea star radii follow N(6.98,3.988)

Do the following data support that the number of males in a litter is a binomial random variable with p=0.5? • H0: the number of males in each litter is a binomial random variable with p=0.5 and n=6. The number of males and females is described by a binomial distribution with p=0.5. ! advantage of KS test to chi-square test: needn’t to calculate the density function for the binomial.

count probability 0 0.016 3 3.45 1 0.094 16 20.72 2 0.234 53 51.80 3 0.313 78 69.06 4 0.234 53 51.80 5 0.094 18 20.72 >=6 0.016 0 3.45 <5

r x k Contingency tables • r: the number of categories k: the number of populations or treatments Oij: the number of observations of category i in population j • To test whether the distribution of a categorical variable is the same in two or more populations • To test whether there is relationship or dependency between the row and column variables

Is the shell species independent to whether it is occupied? • The expected number of observations for each category based on the assumption that the row and column variables are independent: Fraction of row i in the entire sample Fraction of column j in the entire sample

Test for contingency table • Assumptions (two different sampling method) • A random sample, categorized in two ways • k independent random samples, a categorical variable • Hypotheses • H0: the row and column variables are independent • Ha: the row and column variables are not independent • H0: the distribution of the row categories is the same in all k populations • Ha: the distribution of the row categories is not the same in all k populations

Test statistic and Theory • Test statistic • For large sample, n≥40, Eij≥5 • Observed and expected freq equal  small • For small sample

Solution to the example H0: The status (occupied or not) is independent of the shell species Ha: The status is not independent of the shell species observed expected Reject H0, There is an association between species of shell and those that hermit crabs occupy.

2x2 Contingency Tables • A special case of contingency table • Employ a correction factor for discontinuity Correction for discontinuity

Two sample proportion

Exact Fisher test expected value of one cell <5

Enrichment analysis • Gene list • batch annotation • gene-GO term enrichment analysis • highlight the most relevant GO terms associated with a given gene list .

EASE Score, a modified Fisher Exact P-Value • In human genome background (30,000 gene total), 40 genes are involved in p53 signalling pathway. • Fisher Exact P-Value = 0.008 • EASE Score is more conservative to exame the situation. EASE Score = 0.06 (using 3-1 instead of 3).

2x2 Contingency Tables

Extreme Hypothetical Example of Population Stratification • Interested in the LD between allele A and disease.

Sub-population  RpR Gene Disease Population Stratification • Confounding bias that may occur if one’s sample is comprised of sub-populations with different: • allele frequencies (); and • disease rates (RpR) • Cases are more likely than controls to arise from the sub-population with the higher baseline disease rate. • Further, cases and controls will have different allele frequencies regardless of whether the locus is causal.

Example of Population Stratification Cardon & Palmer, 2003

Chapter 11

Chapter 11

Presentation Transcript

CHAPTER 11

Chapter 11

Chapter 11

chapter 11

Chapter 11

Chapter 11

Chapter 11

CHAPTER 11

CHAPTER 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11

Chapter 11