450 likes | 461 Views
This chapter focuses on estimating the proportion of individuals in a population with a certain characteristic using categorical data analysis. It covers methods for estimating proportions, conducting significance tests, and comparing two population proportions. Various examples and techniques are explored.
E N D
Chapter 10 Categorical Data Analysis
Inference for a Single Proportion (p) • Goal: Estimate proportion of individuals in a population with a certain characteristic (p). This is equivalent to estimating a binomial probability • Sample: Take a SRS of n individuals from the population and observe y that have the characteristic. The sample proportion is y/n and has the following sampling properties:
Large-Sample Confidence Interval for p • Take SRS of size n from population where p is true (unknown) proportion of successes. • Observe y successes • Set confidence level (1-a) and obtain za/2 from z-table
Example - Ginkgo and Azet for AMS • Study Goal: Measure effect of Ginkgo and Acetazolamide on occurrence of Acute Mountain Sickness (AMS) in Himalayan Trackers • Parameter: p= True proportion of all trekkers receiving Ginkgo&Acetaz who would suffer from AMS. • Sample Data: n=126trekkers received G&A, y=18 suffered from AMS
Wilson’s “Plus 4” Method • For moderate to small sample sizes, large-sample methods may not work well wrt coverage probabilities • Simple approach that works well in practice (n10): • Pretend you have 4 extra individuals, 2 successes, 2 failures • Compute the estimated sample proportion in light of new “data” as well as standard error:
Example: Lister’s Tests with Antiseptic • Experiments with antiseptic in patients with upper limb amputations (John Lister, circa 1870) • n=12 patients received antiseptic y=1 died
Significance Test for a Proportion • Goal test whether a proportion (p) equals some null value p0H0: p=p0 Large-sample test works well when np0 and n(1-p0) 5
Ginkgo and Acetaz for AMS • Can we claim that the incidence rate of AMS is less than 25% for trekkers receiving G&A? • H0: p=0.25 Ha: p < 0.25 Strong evidence that incidence rate is below 25% (p < 0.25)
Comparing Two Population Proportions • Goal: Compare two populations/treatments wrt a nominal (binary) outcome • Sampling Design: Independent vs Dependent Samples • Methods based on large vs small samples • Contingency tables used to summarize data • Measures of Association: Absolute Risk, Relative Risk, Odds Ratio
Contingency Tables • Tables representing all combinations of levels of explanatory and response variables • Numbers in table represent Counts of the number of cases in each cell • Row and column totals are called Marginal counts
Outcome Present Outcome Absent Group Total Group 1 y1 n1-y1 n1 Group 2 y2 n2-y2 n2 Outcome Total y1+y2 (n1+n2)-(y1+y2) n1+n2 2x2 Tables - Notation
High Quality Low Quality Group Total Not Integrated 33 55 88 Vertically Integrated 5 79 84 Outcome Total 38 134 172 Example - Firm Type/Product Quality • Groups: Not Integrated (Weave only) vs Vertically integrated (Spin and Weave) Cotton Textile Producers • Outcomes: High Quality (High Count) vs Low Quality (Count) Source: Temin (1988)
Notation • Proportion in Population 1 with the characteristic of interest: p1 • Sample size from Population 1: n1 • Number of individuals in Sample 1 with the characteristic of interest: y1 • Sample proportion from Sample 1 with the characteristic of interest: • Similar notation for Population/Sample 2
Example - Cotton Textile Producers • p1 - True proportion of all Non-integretated firms that would produce High quality • p2 - True proportion of all vertically integretated firms that would produce High quality
Notation (Continued) • Parameter of Primary Interest: p1-p2, the difference in the 2 population proportions with the characteristic (2 other measures given below) • Estimator: • Standard Error (and its estimate): • Pooled Estimated Standard Error when p1=p2=p:
Cotton Textile Producers (Continued) • Parameter of Primary Interest: p1-p2, the difference in the 2 population proportions that produce High quality output • Estimator: • Standard Error (and its estimate): • Pooled Estimated Standard Error when p1=p2=p:
Significance Tests for p1-p2 • Deciding whether p1=p2 canbe done by interpreting “plausible values” of p1-p2 from the confidence interval: • If entire interval is positive, conclude p1 > p2 (p1-p2 > 0) • If entire interval is negative, conclude p1 < p2 (p1-p2 < 0) • If interval contains 0, do not conclude that p1 p2 • Alternatively, we can conduct a significance test: • H0: p1 = p2Ha: p1 p2 (2-sided) Ha: p1 > p2 (1-sided) • Test Statistic: • RR: |zobs| za/2 (2-sided) zobs za (1-sided) • P-value: 2P(Z|zobs|) (2-sided) P(Z zobs) (1-sided)
Example - Cotton Textile Production Again, there is strong evidence that non-integrated performs are more likely to produce high quality output than integrated firms
Associations Between Categorical Variables • Case where both explanatory (independent) variable and response (dependent) variable are qualitative • Association: The distributions of responses differ among the levels of the explanatory variable (e.g. Party affiliation by gender)
Contingency Tables • Cross-tabulations of frequency counts where the rows (typically) represent the levels of the explanatory variable and the columns represent the levels of the response variable. • Numbers within the table represent the numbers of individuals falling in the corresponding combination of levels of the two variables • Row and column totals are called the marginal distributions for the two variables
Example - Cyclones Near Antarctica • Period of Study: September,1973-May,1975 • Explanatory Variable: Region (40-49,50-59,60-79) (Degrees South Latitude) • Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8)) (Number of months in parentheses) • Units: Cyclones in the study area • Treating the observed cyclones as a “random sample” of all cyclones that could have occurred Source: Howarth(1983), “An Analysis of the Variability of Cyclones around Antarctica and Their Relation to Sea-Ice Extent”, Annals of the Association of American Geographers, Vol.73,pp519-537
Example - Cyclones Near Antarctica For each region (row) we can compute the percentage of storms occuring during each season, the conditional distribution. Of the 1517 cyclones in the 40-49 band, 370 occurred in Autumn, a proportion of 370/1517=.244, or 24.4% as a percentage.
Example - Cyclones Near Antarctica Graphical Conditional Distributions for Regions
Guidelines for Contingency Tables • Compute percentages for the response (column) variable within the categories of the explanatory (row) variable. Note that in journal articles, rows and columns may be interchanged. • Divide the cell totals by the row (explanatory category) total and multiply by 100 to obtain a percent, the row percents will add to 100 • Give title and clearly define variables and categories. • Include row (explanatory) total sample sizes
Independence & Dependence • Statistically Independent: Population conditional distributions of one variable are the same across all levels of the other variable • Statistically Dependent: Conditional Distributions are not all equal • When testing, researchers typically wish to demonstrate dependence (alternative hypothesis), and wish to refute independence (null hypothesis)
Pearson’s Chi-Square Test • Can be used for nominal or ordinal explanatory and response variables • Variables can have any number of distinct levels • Tests whether the distribution of the response variable is the same for each level of the explanatory variable (H0: No association between the variables • r = # of levels of explanatory variable • c = # of levels of response variable
Pearson’s Chi-Square Test • Intuition behind test statistic • Obtain marginal distribution of outcomes for the response variable • Apply this common distribution to all levels of the explanatory variable, by multiplying each proportion by the corresponding sample size • Measure the difference between actual cell counts and the expected cell counts in the previous step
1 2 … c Total 1 n11 n12 … n1c n1. 2 n21 n22 … n2c n2. … … … … … … r nr1 nr2 … nrc nr. Total n.1 n.2 … n.c n.. Pearson’s Chi-Square Test • Notation to obtain test statistic • Rows represent explanatory variable (r levels) • Cols represent response variable (c levels)
Pearson’s Chi-Square Test • Observed frequency (nij): The number of individuals falling in a particular cell • Expected frequency (Eij): The number we would expect in that cell, given the sample sizes observed in study and the assumption of independence. • Computed by multiplying the row total and the column total, and dividing by the overall sample size. • Applies the overall marginal probability of the response category to the sample size of explanatory category
Pearson’s Chi-Square Test • Large-sample test (all Eij > 5) • H0: Variables are statistically independent (No association between variables) • Ha: Variables are statistically dependent (Association exists between variables) • Test Statistic: • P-value: Area above in the chi-squared distribution with (r-1)(c-1) degrees of freedom. (Critical values in Table 8)
Example - Cyclones Near Antarctica Observed Cell Counts (nij): Note that overall: (1876/9165)100%=20.5% of all cyclones occurred in Autumn. If we apply that percentage to the 1517 that occurred in the 40-49S band, we would expect (0.205)(1517)=310.5 to have occurred in the first cell of the table. The full table of Eij:
Example - Cyclones Near Antarctica Computation of
Example - Cyclones Near Antarctica • H0: Seasonal distribution of cyclone occurences is independent of latitude band • Ha: Seasonal occurences of cyclone occurences differ among latitude bands • Test Statistic: • RR: cobs2 c.05,62 = 12.59 • P-value: Area in chi-squared distribution with (3-1)(4-1)=6 degrees of freedom above 71.2 • From Table 8, P(c222.46)=.001 P< .001
SPSS Output - Cyclone Example P-value
Misuses of chi-squared Test • Expected frequencies too small (all expected counts should be above 5, not necessary for the observed counts) • Dependent samples (the same individuals are in each row, see McNemar’s test) • Can be used for nominal or ordinal variables, but more powerful methods exist for when both variables are ordinal and a directional association is hypothesized
Measures of Association • Absolute Risk (AR): p1-p2 • Relative Risk (RR): p1 / p2 • Odds Ratio (OR): o1 / o2 (o = p/(1-p)) • Note that if p1 = p2 (No association between outcome and grouping variables): • AR=0 • RR=1 • OR=1
Relative Risk • Ratio of the probability that the outcome characteristic is present for one group, relative to the other • Sample proportions with characteristic from groups 1 and 2:
Relative Risk • Estimated Relative Risk: 95% Confidence Interval for Population Relative Risk:
Relative Risk • Interpretation • Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 • Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 • Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1
Example - Concussions in NCAA Athletes • Units: Game exposures among college socer players 1997-1999 • Outcome: Presence/Absence of a Concussion • Group Variable: Gender (Female vs Male) • Contingency Table of case outcomes: Source: Covassin, et al (2003)
Example - Concussions in NCAA Athletes There is strong evidence that females have a higher risk of concussion
Odds Ratio • Odds of an event is the probability it occurs divided by the probability it does not occur • Odds ratio is the odds of the event for group 1 divided by the odds of the event for group 2 • Sample odds of the outcome for each group:
Odds Ratio • Estimated Odds Ratio: 95% Confidence Interval for Population Odds Ratio
Odds Ratio • Interpretation • Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 • Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 • Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1
Osteoarthritis in Former Soccer Players • Units: 68 Former British professional football players and 136 age/sex matched controls • Outcome: Presence/Absence of Osteoathritis (OA) • Data: • Of n1= 68 former professionals, y1 =9 had OA, n1-y1=59 did not • Of n2= 136 controls, y2 =2 had OA, n2-y2=134 did not Interval > 1 Source: Shepard, et al (2003)