Categorical Data

Categorical Data

Example: 1,073 subjects of both genders were recruited for a study where the onset of severe chest pain is recorded for each subject. • Variables: • Onset of severe chest pain (+ve / –ve) • Gender (male / female) Categorical Data Analysis • To identify any association between two categorical data.

Chi-Square Test • Commonly denoted as 2 • Useful in testing for independence between categorical variables (e.g. genetic association between cases / controls • Assumptions • Sufficiently large data in each cell in the cross-tabulation table.

Small Cell Counts • In general, require(a) Smallest expected count is 1 or more(b) At least 80% of the cells have an expected count of 5 or more • Yate’s Continuity CorrectionProvides a better approximation of the test statistic when the data is dichotomous (2  2)

Goodness-of-fit Test • Null hypothesis of a hypothesized distribution for the data. • Expected frequencies calculated under the hypothesized distribution. For example: The number of outbreaks of flu epidemics is charted over the period 1500 to 1931, and the number of outbreaks each year is tabulated. The variable of interest counts the number of outbreaks occurring in each year of that 432 year period. E.g. there were 223 years with no flu outbreaks.

Goodness-of-fit Test • Hypotheses:H0: Data follows a Poisson distribution with mean 0.692H1: Data does not follow a Poisson distribution with mean 0.692 Note: Mean 0.692 is obtained from the sample mean. Expected frequency for X = 0 = 432  P(X = 0), where X ~ Poisson(0.692) Test Statistic , with df = (6 – 1). This yields a p-value of 0.99, indicating that we will almost certainly be wrong if we reject the null hypothesis.

Test for Independence • Most common usage for Pearson’s Chi-square statistic. • Expected frequencies calculated by: • Degrees of freedom = (r – 1)  (c – 1)

Chi-Square Test

Quantification of Effect • 2-test identifies whether there is significant association between the two categorical variables. • But does not quantify the strength and direction of the association. • Need odds ratio to do this. • Odds ratio defines “how many times more likely” it is to be in one category compared to the other: • Example: For the previous example on severe chest pain, males are about 1.4 times more likely to experience severe chest pains than females.

Odds Ratio

Confidence Intervals of Odds Ratio • Not straightforward to obtain confidence intervals of odds ratio (due to complexity in obtaining the variance) • Straightforward to obtain the variance of the logarithm of odds ratio. • Odds ratio is always reported together with the p-values (obtained from Pearson’s Chi-square test), and the corresponding confidence intervals.

Case Study on Lung Cancer and Smoking Odds and Odds Ratio Odds Ratio (OR) = (1301/56)/(1205/152) = 2.93 Pearson’s Chi-square = 47.985, on df = 1 p-value = 0 Var[log(OR)] = = 0.026 95% Confidence interval= = (2.14, 4.02)

More Examples

c2 TEST FOR TREND ORsmoker = 1.52 (0.88, 2.63), p = 0.180 ORex-smoker= 2.11 (1.00, 4.51), p = 0.081 with non-smoker as reference category.

Procedure for Categorical Data Analysis • Summarise data using cross-tabulation tables, with percentages • Perform a chi-square of independence to test for association between the two categorical variables • Quantify any significant association using odds ratios • Always report odds ratios with corresponding 95% confidence interval

Case Study on Lung Cancer and Smoking • Chi-square test statistic = 46.991 • p-value = 7.13  10-12 • Odds ratio = 2.93 • 95% CI = (2.14, 4.02)

Exegesis on Epidemiology Case-Control Study • Compare affected and unaffected individuals • Usually retrospective in nature • Temporal sequence cannot be established (timing for the onset of the disease) • No information on population incidence of the disease Cohort Study • Usually random sampling of subjects within the population • Prospective, retrospective or both • Long follow-up; loss to follow-up • Costly to conduct • Temporal sequence can be established • Provides information on population incidence of the disease

Categorical Data