210 likes | 351 Views
Contingency Tables For Tests of Independence. Multinomials Over Various Categories. Thus far the situation where there are multiple outcomes for the qualitative variable without regard to anything else has been discussed.
E N D
Contingency Tables For Tests of Independence
Multinomials Over Various Categories • Thus far the situation where there are multiple outcomes for the qualitative variable without regard to anything else has been discussed. • Now we discuss whether or not two qualitative variables are related, i.e are they independent?
EXAMPLES (1) Can it be concluded that cola preference and gender are dependent? (2) Can it be concluded that cola preference and age are dependent?
RULE OF 5 • 2 (Chi-squared) is actually only an approximate distribution for the test statistic. • To be a “valid” approximation: ALL ei’s should be 5 • If the rule of 5 is violated, combine some categories so that the condition is met.
COLA PREFERENCE VS. GENDER • The 1000 cola drinkers were further classified as to whether they were male or female. COLA MALE FEMALE ROW TOTAL Coke 240 170 r1 = 410 Pepsi 200 150 r2 = 350 RC 50 30 r3 = 80 Shasta 35 15 r4 = 50 Jolt 75 35 r5 = 110 COLUMN TOTALc1 = 600 c2 = 400 n = 1000
HYPOTHESIS TEST:Can we Conclude Cola Preference and Gender Are Dependent? H0: (NO) Cola preference and gender are independent HA: (YES) Cola preference and gender are dependent = .05 Reject H0 if 2 > 2.05,DF • The correct DF = (r-1)(c-1) = (5-1)(2-1) = (4)(1) = 4 where r = # rows and c = # columns Reject H0 if 2 > 2.05,4 = 9.48773
HOW DO WE GET THE eij’s? Let P(A) = Probability a respondent favors Coke Let P(B) = Probability a respondent is a male If H0 is true: The classifications are independent Thus P(A and B) = P(A)P(B) Best guess for P(A) 410/1000 =.41 Best guess for P(B) 600/1000 = .6 Thus P(A and B) (.41)(.6) = .246 Expected number (Coke and male) =e11= 1000(.246) = 246 This can be gotten by r1c1/n = (410)(600)/1000 =246
CONTIGENCY TABLES • Contingency tables are a convenient way of expressing the results when there are two classifications • It is the equivalent of a multinomial table for two classifications • We put the eij’s in parentheses under (or next to) the fij’s in the table; then we calculate:
eij’s for Cola vs. Gender • Coke/Male e11 = (410)(600)/1000 = 246 • Coke/Female e12 = (410)(400)/1000 = 164 • Pepsi/Male e21 = (350)(600)/1000 = 210 • Pepsi/Female e22 = (350)(400)/1000 = 140 • RC/Male e31 = ( 80)(600)/1000 = 48 • RC/Female e32 = ( 80)(400)/1000 = 32 • Shasta/Male e41 = ( 50)(600)/1000 = 30 • Shasta/Female e42 = ( 50)(400)/1000 = 20 • Jolt/Male e51 = (110)(600)/1000 = 66 • Jolt/Female e52 = (110)(400)/1000 = 44
Notes on Calculating e’s • The column totals may be set in advance or may be random based on the survey. • These eij’s were all whole numbers -- if they are not DO NOT ROUND TO WHOLE NUMBERS. • All these e’s 5 but suppose e52 were actually = 3 • We might combine the results from Shasta and Jolt colas. • This would reduce the number of rows and hence the degrees of freedom. • e52 is not less than 5 here, so we do not have to do this.
CONTINGENCY TABLE FORCOLA vs. GENDER Men Women Total Coke 240 170 410 (246) (164) Pepsi 200 150 350 (210) (140) RC 50 30 80 ( 48) ( 32) Shasta 35 15 50 ( 30) ( 20) Jolt 75 35 110 ( 66) ( 44) Total 600 400 1000
2 for Cola vs. Gender • 2 = (240-246)2/246 + (170-164)2/164 + (200-210)2/210 + (150-140)2/140 + ( 50 - 48)2/ 48 + ( 30- 32)2/ 32 + ( 35 - 30)2/ 30 + ( 15- 20)2/ 20 + ( 75- 66)2/ 66 + ( 35- 44)2/ 44 = 6.92 • 2 = 6.92 < 2.05,4= 9.48773 • There is not enough evidence to conclude gender and cola preference are dependent.
COLA PREFERENCE vs. AGE • Survey results: <20 20-40 40-60 >60 TOTAL Coke 155 140 75 40 410 Pepsi 155 95 75 25 350 RC 30 20 15 15 80 Shasta 20 15 10 5 50 Jolt 40 30 25 15 110 TOTAL 400 300 200 100 1000
HYPOTHESIS TEST • There are r = 5 rows and c = 4 columns H0: (NO) Cola preference and age are independent H1: (YES) Cola preference and age are dependent = .05 Reject H0 if 2 > 2.05,DF • DF = (r-1)(c-1) = (5-1)(4-1) = (4)(3) = 12 Reject H0 if 2 > 2.05,12 = 21.0261
Sample eij’s • e34 =(Row 3 Total)(Column 4 Total)/(Grand Total) = (80) (100) / 1000 = 8 • e41 =(Row 4 Total)(Column 1 Total)/(Grand Total) = (50) (400) / 1000 = 20
CONTINGENCY TABLE FORCOLA vs. AGE <20 20-40 40-60 >60 Total Coke 155 140 75 40 410 (164) (123) (82) (41) Pepsi 155 95 75 25 350 (140) (105) (70) (35) RC 30 20 15 15 80 ( 32) ( 24) (16) ( 8) Shasta 20 15 10 5 50 ( 20) ( 15) (10) ( 5) Jolt 40 30 25 15 110 ( 44) ( 33) (22) (11) Total 400 300 200 100 1000
2 for Cola vs. Age • 2 = (155-164)2/164 + (140-123)2/123 + (75-82)2/82 + (40-41)2/41 + … + ( 40 - 44)2/ 44 + ( 30- 33)2/ 33 + ( 25- 22)2/ 22 + ( 15- 11)2/ 11 = 18.72 • 2 = 18.72 < 2.05,12= 21.0261 • There is not enough evidence to conclude cola preference and age are dependent.
Excel • CHITEST gives the p-value for the test =CHITEST(Observed Values, Expected Values) • Must first calculate the expected values, eij’s • See next slide for easy way to calculate these values.
=SUM(B4:C4) Drag to D5:D8 =SUM(B4:B8) Drag to C9:D9 =$D4*B$9/$D$9 Drag to C13 Then drag B13:C13 to B17:C17 =CHITEST(B4:C8,B13:C17)
=SUM(B4:E4) Drag to F5:F8 =SUM(B4:B8) Drag to C9:D9 =$F4*B$9/$F$9 Drag to E13 Then drag B13:E13 to B17:E17 =CHITEST(B4:E8,B13:E17)
Review • Contingency tables allow for comparisons to determine if two different categories are independent • Excel -- CHITEST is used to generate the p-values for the chi-squared test • Expected Values = (Row Total)(Column Total)/n • By hand -- total degrees of freedom = (r-1)(c-1) and the 2 statistic is calculated by: