690 likes | 965 Views
Chapter 9 Analysis of Two-Way Tables. Two-way (i.e. contingency) tables: to classify & analyze categorical data: Binomial counts: ‘success’ vs. ‘failure’ Proportions: binomial count divided by total sample size.
E N D
Chapter 9 Analysis of Two-Way Tables
Two-way (i.e. contingency) tables: to classify & analyze categorical data: • Binomial counts: ‘success’ vs. ‘failure’ • Proportions: binomial count divided by total sample size
We’ll later see that inference via two-way tables is an alternative—with advantages & disadvantages—to the z-test for comparing two sample proportions: . prtest hsci, by(white)
An advantage of two-way tables is that they can examine more than two variables. • A disadvantage of two-way tables is that they can only do two-sided hypothesis tests.
Here’s a two-way table: . tab hsci white, cell Nonwhite WhiteTotal not hsci 50 103 153 25.0% 51.5% 76.5% hsci 5 42 47 2.5% 21.0% 23.5% Total 55 145 200 27.5% 72.5% 100.0%
Nonwhite WhiteTotal not hsci 50 103 153 25.0% 51.5% 76.5% hsci 5 42 47 2.5% 21.0% 23.5% Total 55 145 200 27.5% 72.5% 100.0% • The row variable: hsci vs. not hsci. The column variable: white vs. nonwhite.
Nonwhite WhiteTotal not hsci 50 103 153 25.0% 51.5% 76.5% hsci 5 42 47 2.5% 21.0% 23.5% Total 55 145 200 27.5% 72.5% 100.0% • Cells: each combination of values for the two variables (50, 103, 5, 42).
Nonwhite WhiteTotal not hsci 50 103 153 25.0% 51.5% 76.5% hsci 5 42 47 2.5% 21.0% 23.5% Total 55 145 200 27.5% 72.5% 100.0% • Joint distributions: Each cell’s percentage of the total sample (50/200=.250; 103/200=.515; 5/200=.025; 42/200=.210).
Nonwhite WhiteTotal not hsci 50 103 153 25.0% 51.5% 76.5% hsci 5 42 47 2.5% 21.0% 23.5% Total 55145 200 27.5% 72.5% 100.0% • The marginal frequencies: the row totals (153, 47) & the column totals (55, 145).
Nonwhite WhiteTotal not hsci 50 103 153 25.0% 51.5% 76.5% hsci 5 42 47 2.5% 21.0% 23.5% Total 55 145 200 27.5%72.5% 100.0% • The marginal distributions: each row total/sample total (76.5%, 23.5%). Each column total/sample total (27.5%, 72.5%).
Here are the same data displayed as column conditional probabilities: . tab hsci white, col nofreq nonwhite white Total no 90.91% 71.03% 76.5% yes 9.09% 28.97% 23.5% Total 100.0% 100.% 100.00% • The conditional distributions (i.e conditional probabilities): Column—divide each column cell count by its column total count.
Here are the same data displayed as row conditional probabilities: . tab hsci white, row nofreq nonwhite white Total not hsci 32.68% 67.32% 100.0% hsci 10.64% 89.36% 100.0% Total 27.5% 72.5% 100.00% • The conditional distributions (i.e conditional probabilities): Row—divide each row cell count by its row total count.
Tip:It’s usually best to compute conditional distributions (i.e. probabilities) across the categories of the explanatory variable. • E.g., tab hsci white, col: computes the conditional distributions across the categories of the explanatory variable race-ethnicity (i.e. white vs. nonwhite). • Alternatively, you may want to compare joint distributions (i.e. cell counts/total sample): tab hsci white, cell
We’ve discussed the following: • row variables • column variables • cells: each combination of values for the two variables. • joint distributions: each cell’s percentage of the total sample.
marginal frequencies • marginal distributions: each marginal frequency/total sample size • column conditional distributions: divide each column cell count by its column total count • row conditional distributions: divide each row cell count by its row total count.
And we’ve said that typically it’s best to compute the conditional distributions (i.e. probabilities) across the categories of the explanatory variable. • Or that it may be preferable to compare joint distributions (i.e. compare the cell probabilities).
Simpson’s Paradox • An NSF study found that the median salary of newly graduated female engineers & scientists was just 73% of the median salary for males. Here are women’s median salaries in the 16 fields as a percentage of male salaries: • 94% 96% 98% 95% 85% 85% 84% 100% 103% 100% 107% 93% 104% 93% 106% 100% • How can it be that, on average, the women earn just 73% of the median salary for males, since no listed % falls below 84%?
Because women are disproportionately located in the lower-paying fields of engineering & science. • That is, ‘field of science & engineering’ is a lurking variable (i.e. an unmeasured confounded variable) that influences the observed association between gender & salary.
Simpson’s Paradox: the reversal of a bivariate relationship due to the influence of a lurking variable. • Aggregating data has the effect of ignoring one or more lurking variables. • Another example: comparing hospital mortality rates. • Yet another: comparing airline on-time rates.
Conclusion from • Simpson’s Paradox • Always be on the lookout for lurking variables with aggregated data!! • A bivariate relationship may change direction when a third, control variable is introduced.
What’s a control variable? • Holding a variable constant makes it acontrol variable: doing so removes the part of the bivariate relationship that was caused by the control variable. • That is, controlling for a variable neutralizes its influence on the observed relationship. • E.g., controlling for field of science & engineering. • E.g., controlling for race/ethnicity.
To repeat, holding a variable constant removes its statistical effects from the bivariate association being examined. • Doing so ensures (more or less) that a bivariate relationship is assessed apart from the influence of the controlled variable: e.g., the relationship between a Montessori school program & student IQ scores, holding constant social class.
What’s better: statistical control or experimental control? • The answer returns us to the matter of observational study versus experimental study (see Moore/McCabe, chapter 3).
Good experimental design controls for all possible lurking variables. Why? • But statistical control cannot do so. Why not? • Moreover, statistical control is weakened by the imprecision of measurement of variables. • But we can’t experiment on everything.
A bivariate association may not appear until a third, control variable is introduced. • The apparent absence of the bivariate relationship is called spurious non-association. • E.g., no association between years of education & level of income in post-WW II data, until controlling for age of respondents.
Conclusion from • Spurious Non-Association • Explore not just bivariate relationships but also multivariate relationships among all the variables of potential practical or theoretical relevance.
Here’s how to add a control variable to a two-way table in Stata: bys female: tab hsci white, cell chi2
male nonwhite white Total 0 21 41 62 23.08% 45.05% 68.13% 1 2 27 29 2.20% 29.67% 31.87% Total 23 68 91 25.27% 74.73% 100.00 % Pearson chi2(1) = 7.6120 Pr = 0.006 female nonwhite white Total 0 29 62 91 26.61% 56.88% 83.49% 1 3 15 18 2.75% 13.76% 16.51% Total 32 77 109 29.36% 70.64% 100.00% Pearson chi2(1) = 1.6744 Pr = 0.196
This example introduces a test of statistical significance for two-way tables. • The test is based on the Chi-square statistic.
male nonwhite white Total 0 21 41 62 23.08% 45.05% 68.13% 1 2 27 29 2.20% 29.67% 31.87% Total 23 68 91 25.27% 74.73% 100.00 % Pearson chi2(1) = 7.6120 Pr = 0.006 female nonwhite white Total 0 29 62 91 26.61% 56.88% 83.49% 1 3 15 18 2.75% 13.76% 16.51% Total 32 77 109 29.36% 70.64% 100.00% Pearson chi2(1) = 1.6744 Pr = 0.196
Note in the example that the two-way table for male tests insignificant. • How do two-way tables & their test of significance evaluate the data? • They do so by comparing expected & observed cell counts in terms of proportional distributions.
Back to the two-way table without the control variable: • : • . tab hsci white, cell • Nonwhite White Total • not hsci 50 103 153 • 25.0% 51.5% 76.5% • hsci 5 42 47 • 2.5% 21.0% 23.5% • Total 55 145 200 • 27.5% 72.5% 100.0%
Describing Relations in Two-Way Tables • The original data must be counts. • Inference for two-way tables: compare the observed cell counts to the expected cell counts; then compute the Chi-square significance test.
We begin by computing the expected cell counts: row total times column total, divided by total sample size. • Premise: the null hypothesis of ‘statistical independence’ (i.e. no association between the variables) characterizes the data.
Expected cell counts: row total times column total, divided by total sample size. nonwhite white Total no 50 103 153 yes 5 42 47 Total 55 145200
nonwhite white Total no 50 103 153 yes 5 42 47 Total 55 145 200 . di (153*55)/200=42.075 . di (153*145)/200=110.925 . di (47*55)/200= 12.925 . di (47*145)/200= 34.075 • How do the expected cell counts compare to the observed cell counts: Do the conditional probabilities appear to be equal for nonwhites & whites across no-hsci & yes hsci?
Expected count for each cell: its row total times its column total, divided by the total sample size. • Each expected cell count is based on the proportion of the total sample accounted for by its entire row & by its entire column. • The Chi-square test assumes independence (i.e. no association) between the conditional distributions of nonwhites & whites in honors science.
That is, each expected cell count reflects the null hypothesis of statistical independence (i.e. no association): • that the proportion of non-white honors science students is simply the proportion of non-white students in the population. • that the proportion of white honors science students is simply the proportion of white students in the population. • What’s the alternative hypothesis?
Chi-Square Test Assumptions • Random sample • Two categorical variables • Count data • At least 5 observations in 80% of the cells & no less than 1 observation in any cell (best if there’s at least 5 observations in all cells)
If the assumptions are fulfilled, use the Chi-square test: tab hsci white, chi2 • If the numbers of observations per cell don’t meet the assumptions, use ‘Fisher’s exact test’ (a non-parametric test, which may be very slow): tab hsci white, exact
Chi-square statistic: measures how much the observed cell counts in a two-way table diverge from the expected cell counts. • It’s therefore a test of independence: Ho: the variables are independent from each other Ha: they are not independent from each other
Step 1: Chi-square = summation for all cells of (observed cell count – expected cell count)squared, divided by the cell’s expected count • Step 2: df = (# row vars –1) (# column vars – 1) • Step 3: Chi-square significance test=Chi-square/df
Chi-square/df statistic: positive values only • Has a distinct distribution for each degree of freedom (see Moore/McCabe) • Two-sided hypothesis test only
Chi-square Test: To Repeat… • Chi-square statistic: measures how much the observed cell counts in a two-way table diverge from the expected cell counts. • That is, it compares the sample distribution with a hypothesized distribution. • It’s a test of statistical independence (Ho: no association; Ha: association).
Step 1: Chi-square = summation for all cells of (observed cell count – expected cell count)squared, divided by the cell’s expected count • Step 2: df = (# row vars –1) (# column vars – 1) • Step 3: Chi-square significance test=Chi-square/df
Hypothesis Test Ho: hsci whites = hsci nonwhites Ha: hsci whites ~= hsci nonwhites (i.e. two-sided alternative) • Chi-square test: two-sided alternative hypothesis only.
. tab hsci white, cell chi2 • nonwhite white Total • no 50 103 153 • 25.0% 51.5% 76.5% • yes 5 42 47 • 2.5% 21.0% 23.5% • Total 55 145 200 • 27.5% 72.5% 100.0% • Pearson chi2(1) = 8.7613 Pr = 0.003 • Conclusion: Reject the null hypothesis.
Let’s repeat the earlier example to see what happens when we add a control variable to the two-way table: bys female: tab hsci nonwhite, col chi2