Chapter 14: Analysis of Categorical Data

Chapter 14: Analysis of Categorical Data General Objectives: Many types of surveys and experiments result in qualitative rather than quantitative response variables, so that the responses can be classified but not quantified. Data from these experiments consist of the count or number of observations that fall into each of the response categories included in the experiment. In this chapter, we are concerned with methods for analyzing categorical data. ©1998 Brooks/Cole Publishing/ITP

Specific Topics 1. The multinomial experiment 2. Pearson’s chi-square statistic 3. A test of specified cell probabilities 4. Contingency tables 5. Comparing several multinomial populations 6. Other applications 7. Assumptions for chi-square tests ©1998 Brooks/Cole Publishing/ITP

14.1 A Description of the Experiment • Measurements can be qualitative or categorical rather that quantitative, e.g., male or female versus height. • A quality or characteristic, rather than a numerical value, is measured for each experimental unit, e.g., gender, ethnicity, etc. • Report a count for the number of measurements that fall into each category, e.g., 6 males and 4 females. • Examples: - People can be classified into five income brackets. - A mouse can respond in one of three ways to a stimulus. - An M&M can have one of six colors. - An individual process manufactures items that can be classified as acceptable, “second quality,” or “defective.” ©1998 Brooks/Cole Publishing/ITP

The Multinomial Experiment: - The experiment consists of n identical trials. - The outcome of each trial falls into one of k categories. - The probabilities that the outcome of a single trial falls into a particular category—say, i—is pi and remains constant from trial to trial. This probability must be between 0 and 1, for each of the k categories, and the sum of all k probabilities is Spi = 1. - The trials are independent. - The experimenter counts the observed number of outcomes in each category, written as O1, O2, …,Ok with O1+ O2+ …+ Ok= n • In this chapter we extend the idea of the binomial random variable to make inferences about the multinomial parametersp1, p2, …,pk , using a different type of statistic. • This statistic is called the chi-square (sometimes Pearson’s chi-square) statistic. ©1998 Brooks/Cole Publishing/ITP

14.2 Pearson’s Chi-Square Statistic In general, the expected number of balls that fall into cell i— written as Ei —can be calculated using the formula Ei = npi for any of the cells i =1, 2,¼,k. • Let the actual observed cell counts, Oi , not be too different than the expected cell counts, Ei = npi . • The Pearson chi-square statistic uses the differences (Oi=Ei )by first squaring these differences to eliminate negative contributions, and then forming a weighted average of the squared differences. Pearson’s Chi-Square Test Statistic summed over all k cells, with Ei = npi . ©1998 Brooks/Cole Publishing/ITP

When n is large, C2 has an approximate chi-square probability distribution in repeated sampling. • You should use a right-tailed statistical test to look for an unusually large value of the test statistic. • Like the F distribution, its shape is not symmetric and depends on a specific number of degrees of freedom. Use Table 5 in Appendix I. Determining the Appropriate Number of Degrees of Freedom 1. Start with the number of categories or cells in the experiment. 2. Subtract one degree of freedom for each linear restriction on the cell probabilities. You will always lose one df becausep1+ p2+ …+ pk= 1. 3. Sometimes the expected cell counts cannot be calculated directly but must be estimated using the sample data. Subtract one degree of freedom for every independent population parameter that must be estimated to obtain the estimated values ofEi . ©1998 Brooks/Cole Publishing/ITP

14.3 Testing Specified Cell Probabilities: The Goodness-of-Fit Test • The simplest hypothesis concurring the cell probabilities specifies a numerical value for each cell. • The expected cell counts are easily calculated using the hypothesized probabilities, Ei = npi, and are used to calculate the observed value of the C2 test statistic. See Example 14.1. Example 14.1 A researcher designs an experiment in which a rat is attracted to the end of a ramp that divides, leading to doors of three different colors. The researcher sends the rat down the ramp n= 90 times and observes the choices listed in Table 14.1. Does the rat have (or acquire) a preference for one of the three doors? ©1998 Brooks/Cole Publishing/ITP

Table 14.1 Door Green Red Blue Observed count (Oi ) 20 39 31 Solution If the rat has no preference in the choice of a door, you would expect in the long run that the rat would choose each door an equal number of times. That is, the null hypothesis is versus the alternative hypothesis Ha : At least one pi is different from 1/3 ©1998 Brooks/Cole Publishing/ITP

where pi is the probability that the rat chooses door i, for i = 1, 2, and 3. The expected cell counts are the same for each of the three categories, namely npi= 90(1/3) = 30. The chi-square test statistic can now be calculated as For this example, the test statistic has (k- 1) = 2 degrees of freedom because the only linear restriction on the cell proba-bilities is that they must sum to 1. Hence, you can use Table 5 in Appendix I to find the bounds for the right-tailed p-value. Since the observed value, C2 = 6.067, lies between and the p-value is between .025 and .050. The researcher would report the results as significant at the 5% level (P < .05), meaning that the null hypothesis of no preference is rejected. There is sufficient evidence to indicate that the rat has a preference for one of the three doors. ©1998 Brooks/Cole Publishing/ITP

What more can you say about the experiment once you have determined statistically that the rat has a preference? Look at the data to see where the differences lie. It appears that, for this sample, the rat chooses the blue door about one-third of the time: 31/90 =.344. However, the sample proportions for the other two doors are quite different from one-third. The rat chooses the green door least often—only 22% of the time: 20/91 =.222. The rat chooses the red door most often—43% of the time:39/90 =.433. You would summarize the results of the experiment by saying that the rat has a preference for the red door. Can you conclude that the preference is caused by the door color? The answer is no—the cause could be some other physiological or psychological factor that you have not yet explored. Avoid declaring a causal relationship between color and preference. ©1998 Brooks/Cole Publishing/ITP

14.4 Contingency Tables: A Two-Way Classification • In some situations, the researcher classifies an experimental unit according to two qualitative variables to generate bivariate data. • Examples - A defective piece of furniture is classified according to the type of defect and the production shift during which it was made. - A professor is classified by professional rank and the type of university (public or private) at which she works. - A patient is classified according to the type of preventative flu treatment he received and whether or not he contracted the flu during the winter. ©1998 Brooks/Cole Publishing/ITP

When two categorical variables are recorded, you can summarize the data by counting the observed number of units that fall into each of the various intersections of category levels, e.g., males in height range i and weight range k or females of race l who play sport m. • The resulting counts are displayed in an array called a contingency table. See Example 14.3 and Table 14.3. Example 14.3 A total of n= 309 furniture defects were recorded, and the defects were classified into four types: A, B, C, or D. At the same time, each piece of furniture was identified by the production shift in which it was manufactured. These counts are presented in a contingency table in Table 14.3. ©1998 Brooks/Cole Publishing/ITP

Table 14.3 Type of Defect Shift A B C D Total 1 15 21 45 13 94 2 26 31 34 5 96 3 33 17 49 20 119 Total 74 69 128 38 309 • In the analysis of a contingency table, the objective is to determine whether or not one method of classification is contingent or dependent on the other method. • If not, the two methods are said to be independent. ©1998 Brooks/Cole Publishing/ITP

The Chi-Square Test of Independence These are the hypotheses: - H0 : The two methods of classification are independent. - Ha : The two methods of classification are dependent. • Consider pij, the probability that an observation falls into row i and column j of the contingency table. If the rows and columns are independent, then: pij= P(observation falls in row i and column j ) = P(observation falls in row i)´ P(observation falls in column j) = pipj where pi and pj are the unconditional or marginal probabilities of falling into row i or column j, respectively. ©1998 Brooks/Cole Publishing/ITP

To estimate a row probability, use • To estimate a column probability, use Estimated Expected Cell Count: where ri is the total for row i and pj is the total for column j. • The chi-square test statistic for a contingency table with r rows and c columns is calculated as ©1998 Brooks/Cole Publishing/ITP

and can be shown to have an approximate chi-square distribution with df = (r -1)(c -1). • If the observed value of C2 is too large, then the null hypothesis of independence is rejected. Determining the Appropriate Number of Degrees of Freedom: Remember the general procedure for determining degrees of freedom: 1. Start with k = r ´ c categories or cells in the contingency table. 2. Subtract one degree of freedom because all of the rc cell probabilities must sum to 1. 3. You had to estimate (r -1) row probabilities and (c -1) column probabilities to calculate the estimated cell counts. Subtract (r -1) and (c -1) df. The total degrees of freedom for the r ´ c contingency table are df =rc -1 - (r -1) - (c -1) = (r -1)(c -1) ©1998 Brooks/Cole Publishing/ITP

14.5 Comparing Several Multinomial Populations:A Two-Way Classification with Fixed Row or Column Totals • An r ´ c contingency table results when each of n experimental units is counted as falling into one of the rc cells of a multinomial experiment. • Each cell represents a pair of category levels—row level i and column level j. • It might be better to decide ahead of time to survey equal numbers of experimental units in each level of a given variable. See Example 14.6. ©1998 Brooks/Cole Publishing/ITP

Example 14.6 In another flu prevention experiment like Example 14.5, the experimenter decides to search the clinic records for 300 patients in each of the three treatment categories: no vaccine, one shot, and two shots. The n= 900 patients will then be surveyed regarding their winter flu history. The experiment results in a 2 ´ 3 table with the column totals fixed at 300, shown in Table 14.8. By fixing the column totals, the experimenter no longer has a multinomial experiment with 2 ´ 3 = 6 cells. Instead, there are three separate binomial experiments—call them 1, 2, and 3—each with a given probability pj of contracting the flu and qj of not contracting the flu. (Remember that for a binomial population pj+ qj= 1.) ©1998 Brooks/Cole Publishing/ITP

Table 14.8 No Vaccine One Shot Two Shots Total Flu r1 No Flu r2 Total 300 300 300 n Suppose you used the chi-square test to test for the indepen-dence of row and column classifications. If a particular treatment (column level) does not affect the incidence of flu, then each of the three binomial populations should have the same incidence of flu so that p1= p2=p3 and q1= q2=q3. ©1998 Brooks/Cole Publishing/ITP

The 2 ´ 3 classification of Example 14.6 describes a situation in which the chi-square test of independence is equivalent to a test of equality of c = 3 binomial proportions. • Tests of this type are called tests of homogeneity. • Whether or not the columns (or rows) are fixed, the test statistic is calculated as which has an approximate chi-square distribution in repeated sampling with df = (r -1)(c -1). ©1998 Brooks/Cole Publishing/ITP

Determining the Appropriate Number of Degrees of Freedom: 1. Start with the rc cells in the two-way table. 2. Subtract one degree of freedom for each of the c multinomial populations, whose column probabilities must add to 1—a total of c df. 3. You had to estimate (r -1) row probabilities, but the column probabilities are fixed in advance and did not need to be estimated. Subtract (r -1) df. The total degrees of freedom for the r ´ c (fixed-column) table are df =rc - c- (r -1) = rc - c- r +1 = (r -1)(c -1) ©1998 Brooks/Cole Publishing/ITP

14.6 The Equivalence of Statistical Tests • A multinomial experimental with two categories is a binomial experiment. • The data from two binomial experiments can be displayed as a two-way classification. • There are statistical tests for these two situations based on the z statistic of Chapter 9: ©1998 Brooks/Cole Publishing/ITP

14.7 Other Applications of the Chi-Square Test • Goodness-of-fit tests: It is possible to design a goodness-of-fit test to determine whether data are consistent with data drawn from a particular probability distribution, possibly the normal, binomial, Poisson, or other distributions. • Time-dependent multinomial: The chi square statistic can be used to investigate the rate of change of multinomial proportions over time. • Multidimensional contingency tables: Instead of only two methods of classification, it is possible to investigate a dependence among three or more classifications. • Log-linear models: Complex models can be created in which the logarithm of the cell probability (ln pij) is some linear function of the row and column probabilities. ©1998 Brooks/Cole Publishing/ITP

Assumptions for Pearson’s Chi-Square: - The cell counts O1, O2, …,Ok must satisfy the conditions of a multinomial experiment, or a set of multinomial experiments created by fixing either the row or the column totals. - The expected cell counts E1, E2, …, Ek should equal or exceed five. • When you calculate the expected cell counts, if you find that one or more is less than five, these options are available to you: - Choose a larger sample size n. The larger the sample size, the closer the chi-square distribution will approximate the distribution of your test statistic C2. - It may be possible to combine one or more of the cells with small expected cell counts, thereby satisfying the assumption. ©1998 Brooks/Cole Publishing/ITP

Key Concepts and Formulas I. The Multinomial Experiment 1. There are n identical trials, and each outcome falls into one of k categories. 2. The probability of falling into category i is pi and remains constant from trial to trial. 3. The trials are independent, Spi= 1, and we measure Oi , the number of observations that fall into each of the k categories. II. Pearson’s Chi-Square Statistic which has an approximate chi-square distribution with degrees of freedom determined by the application. ©1998 Brooks/Cole Publishing/ITP

III. The Goodness-of-Fit Test 1. This is a one-way classification with cell probabilities specified in H0. 2. Use the chi-square statistic with Ei = npi calculated with the hypothesized probabilities. 3. df= k -1-(Numberofparameters estimated in order to find Ei) 4. If H0 is rejected, investigate the nature if the differences using the sampling proportions. IV. Contingency Table 1. A two-way classification with n observations into r ´ c cells of a two-way table using two different methods of classification is called a contingency table. ©1998 Brooks/Cole Publishing/ITP

2. The test for independence of classifications methods uses the chi-square statistic 3. If the null hypothesis of independence of classifications is rejected, investigate the nature of the dependency using conditional proportions within either the rows or columns of the contingency table. V. Fixing Row or Column Totals 1. When either the row or column totals are fixed, the test of independence of classifications becomes a test of the homogeneity of cell probabilities for several multinomial experiments. ©1998 Brooks/Cole Publishing/ITP

2. Use the same chi-square statistic as for contingency tables. 3. The large-sample z tests for one and two binomial proportions are special cases of the chi-square statistic. VI. Assumptions 1. The cell counts satisfy the conditions of a multinomial experiment, or a set of multinomial experiments with fixed sample sizes. 2. All expected cell counts must equal or exceed five in order that the chi-square approximation is valid. ©1998 Brooks/Cole Publishing/ITP

Chapter 14: Analysis of Categorical Data