Chapter

Chapter 12 Inference on Categorical Data

Section 12.1 Goodness-of-Fit Test

Objective • Perform a goodness-of-fit test

Characteristics of the Chi-Square Distribution • It is not symmetric.

Characteristics of the Chi-Square Distribution • It is not symmetric. • The shape of the chi-square distribution depends on the degrees of freedom, just like Student’s t-distribution.

Characteristics of the Chi-Square Distribution • It is not symmetric. • The shape of the chi-square distribution depends on the degrees of freedom, just like Student’s t-distribution. • As the number of degrees of freedom increases, the chi-square distribution becomes more nearly symmetric.

Characteristics of the Chi-Square Distribution • It is not symmetric. • The shape of the chi-square distribution depends on the degrees of freedom, just like Student’s t-distribution. • As the number of degrees of freedom increases, the chi-square distribution becomes more nearly symmetric. • The values of 2 are nonnegative, i.e., the values of 2 are greater than or equal to 0.

A goodness-of-fit test is an inferential procedure used to determine whether a frequency distribution follows a specific distribution.

Expected Counts Suppose that there are n independent trials of an experiment with k ≥ 3 mutually exclusive possible outcomes. Let p1 represent the probability of observing the first outcome and E1 represent the expected count of the first outcome; p2 represent the probability of observing the second outcome and E2 represent the expected count of the second outcome; and so on. The expected counts for each possible outcome are given by Ei = i = npi for i = 1, 2, …, k

Parallel Example 1: Finding Expected Counts A sociologist wishes to determine whether the distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in 2000. According to the United States Census Bureau, in 2000, 22.8% of grandparents have been responsible for their grandchildren less than 1 year; 23.9% of grandparents have been responsible for their grandchildren for 1 or 2 years; 17.6% of grandparents have been responsible for their grandchildren 3 or 4 years; and 35.7% of grandparents have been responsible for their grandchildren for 5 or more years. If the sociologist randomly selects 1,000 care-giving grandparents, compute the expected number within each category assuming the distribution has not changed from 2000.

Solution Step 1: The probabilities are the relative frequencies from the 2000 distribution: p<1yr = 0.228 p1-2yr = 0.239 p3-4yr = 0.176 p ≥5yr = 0.357

Solution Step 2: There are n=1,000 trials of the experiment so the expected counts are: E<1yr = np<1yr = 1000(0.228) = 228 E1-2yr = np1-2yr = 1000(0.239) = 239 E3-4yr = np3-4yr =1000(0.176) = 176 E≥5yr= np ≥5yr = 1000(0.357) = 357

Test Statistic for Goodness-of-Fit Tests Let Oirepresent the observed counts of category i, Ei represent the expected counts of category i, k represent the number of categories, and n represent the number of independent trials of an experiment. Then the formula approximately follows the chi-square distribution with k-1 degrees of freedom, provided that • all expected frequencies are greater than or equal to 1 (all Ei ≥ 1) and • no more than 20% of the expected frequencies are less than 5.

CAUTION! Goodness-of-fit tests are used to test hypotheses regarding the distribution of a variable based on a single population. If you wish to compare two or more populations, you must use the tests for homogeneity presented in Section 12.2.

The Goodness-of-Fit Test To test the hypotheses regarding a distribution, we use the steps that follow. Step 1:Determine the null and alternative hypotheses. H0: The random variable follows a certain distribution H1: The random variable does not follow a certain distribution

Step 2:Decide on a level of significance, , depending on the seriousness of making a Type I error.

Step 3: • Calculate the expected counts for each of the k categories. The expected counts are Ei=npifor i = 1, 2, … , k where n is the number of trials and pi is the probability of the ith category, assuming that the null hypothesis is true.

Step 3: • Verify that the requirements for the goodness-of-fit test are satisfied. • All expected counts are greater than or equal to 1 (all Ei ≥ 1). • No more than 20% of the expected counts are less than 5. • Compute the test statistic: Note:Oi is the observed count for the ith category.

CAUTION! If the requirements in Step 3(b) are not satisfied, one option is to combine two or more of the low-frequency categories into a single category.

Classical Approach Step 4:Determine the critical value. All goodness-of-fit tests are right-tailed tests, so the critical value is with k-1 degrees of freedom.

Classical Approach Step 5:Compare the critical value to the test statistic. If reject the null hypothesis.

P-Value Approach Step 4:Use Table VII to obtain an approximate P-value by determining the area under the chi-square distribution with k-1 degrees of freedom to the right of the test statistic.

P-Value Approach Step 5:If the P-value < , reject the null hypothesis.

Step 6:State the conclusion.

Parallel Example 2: Conducting a Goodness-of -Fit Test A sociologist wishes to determine whether the distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in 2000. According to the United States Census Bureau, in 2000, 22.8% of grandparents have been responsible for their grandchildren less than 1 year; 23.9% of grandparents have been responsible for their grandchildren for 1 or 2 years; 17.6% of grandparents have been responsible for their grandchildren 3 or 4 years; and 35.7% of grandparents have been responsible for their grandchildren for 5 or more years. The sociologist randomly selects 1,000 care-giving grandparents and obtains the following data.

Test the claim that the distribution is different today than it was in 2000 at the  = 0.05 level of significance.

Solution Step 1: We want to know if the distribution today is different than it was in 2000. The hypotheses are then: H0: The distribution for the number of years care-giving grandparents are responsible for their grandchildren is the same today as it was in 2000 H1: The distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in 2000

Solution Step 2: The level of significance is =0.05. Step 3: (a) The expected counts were computed in Example 1.

Solution Step 3: • Since all expected counts are greater than or equal to 5, the requirements for the goodness-of-fit test are satisfied. • The test statistic is

Solution: Classical Approach Step 4: There are k = 4 categories, so we find the critical value using 4-1=3 degrees of freedom. The critical value is

Solution: Classical Approach Step 5: Since the test statistic, is less than the critical value , we fail to reject the null hypothesis.

Solution: P-Value Approach Step 4: There are k = 4 categories. The P-value is the area under the chi-square distribution with 4-1=3 degrees of freedom to the right of . Thus, P-value ≈ 0.09.

Solution: P-Value Approach Step 5: Since the P-value ≈ 0.09 is greater than the level of significance  = 0.05,we fail to reject the null hypothesis.

Solution Step 6: There is insufficient evidence to conclude that the distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in 2000 at the  = 0.05 level of significance.

Section 12.2 Tests for Independence and the Homogeneity of Proportions

Objectives • Perform a test for independence • Perform a test for homogeneity of proportions

Objective 1 • Perform a Test for Independence

The chi-square test for independence is used to determine whether there is an association between a row variable and column variable in a contingency table constructed from sample data. The null hypothesis is that the variables are not associated; in other words, they are independent. The alternative hypothesis is that the variables are associated, or dependent.

“In Other Words” In a chi-square independence test, the null hypothesis is always H0: The variables are independent The alternative hypothesis is always H0: The variables are not independent

The idea behind testing these types of claims is to compare actual counts to the counts we would expect if the null hypothesis were true (if the variables are independent). If a significant difference between the actual counts and expected counts exists, we would take this as evidence against the null hypothesis.

If two events are independent, then P(A and B) = P(A)P(B) We can use the Multiplication Principle for independent events to obtain the expected proportion of observations within each cell under the assumption of independence and multiply this result by n, the sample size, in order to obtain the expected count within each cell.

Parallel Example 1: Determining the Expected Counts in a Test for Independence In a poll, 883 males and 893 females were asked “If you could have only one of the following, which would you pick: money, health, or love?” Their responses are presented in the table below. Determine the expected counts within each cell assuming that gender and response are independent. Source: Based on a Fox News Poll conducted in January, 1999

Solution Step 1: We first compute the row and column totals:

Solution Step 2: Next compute the relative marginal frequencies for the row variable and column variable:

Solution Step 3: Assuming gender and response are independent, we use the Multiplication Rule for Independent Events to compute the proportion of observations we would expect in each cell.

Solution Step 4: We multiply the expected proportions from step 3 by 1776, the sample size, to obtain the expected counts under the assumption of independence.

Expected Frequencies in a Chi-Square Test for Independence To find the expected frequencies in a cell when performing a chi-square independence test, multiply the row total of the row containing the cell by the column total of the column containing the cell and divide this result by the table total. That is,

Test Statistic for the Test of Independence Let Oi represent the observed number of counts in the ith cell and Ei represent the expected number of counts in the ith cell. Then approximately follows the chi-square distribution with (r-1)(c-1) degrees of freedom, where r is the number of rows and c is the number of columns in the contingency table, provided that (1) all expected frequencies are greater than or equal to 1 and (2) no more than 20% of the expected frequencies are less than 5.

Chi-Square Test for Independence To test the association (or independence of) two variables in a contingency table: Step 1:Determine the null and alternative hypotheses. H0: The row variable and column variable are independent. H1: The row variable and column variables are dependent.

Chapter

Chapter

Presentation Transcript

Chapter

Chapter

Chapter

chapter:

Chapter

Chapter # - Chapter Title

Chapter

chapter:

Chapter 6 Chapter Review

Chapter # - Chapter Title

CHAPTER 3 CHAPTER REVIEW

Chapter 1. Chapte r 2. Chapter 3. Chapter 4. Chapter 5. Chapter 6. Chapter 7.

Chapter Number Chapter Title

CHAPTER

chapter

Chapter

CHAPTER

Chapter # - Chapter Title

Chapter # - Chapter Title

CHAPTER 5 CHAPTER 2

Chapter 17 Chapter Review

Chapter 18 Chapter 18