380 likes | 652 Views
Categorical Data Analysis & Logistic Regression. Outline. Two-way contingency tables: RR, Odds ratio, Chi-square tests Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio Logistic regression: Dichotomous response
E N D
Outline • Two-way contingency tables: RR, Odds ratio, Chi-square tests • Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio • Logistic regression: Dichotomous response • Logistic regression: Polytomous response
First example: Aspirin & heart attacks • Clinical trials table of aspirin use and MI • Test whether regular intake of aspirin reduces mortality from cardiovascular disease • Data set • Prospective sampling design: Cohort studies, Clinical trials
Second example: Smoking & heart attacks • Case-control study: table of smoking status and MI • Compare ever-smokers with nonsmokers in terms of the proportion who suffered MI • Data set • Retrospective sampling design: Case-control study, Cross-sectional design • Remark: Observational studies vs. experimental study
Comparing proportions in table • Difference: • Relative risk: • Useful when both proportions 0 or 1 • : RR is more informative • : Response is independent of group
Example (revisited) • 1st example • =0.0171-0.0094=0.0077, 95% CI=(0.005, 0.011) • Taking aspirin diminishes heart attack • , 95% CI=(1.43, 2.3) • Risk of MI is at least 43% higher for the placebo group • 2nd example • , : Not estimable, meaningless even though possible • Estimate proportions in the reverse direction • Proportion of smoking given MI status: (suffering MI), (Not suffered MI)
Association measure: odds ratio • Def’n: • Meaning • When two variables are independent, i.e., • When odds of success (in row 1) > (in row 2) • When odds of success (in row 1) < (in row 2) • Remark: When both variables are response, (called cross-product ratio) using joint probabilities
Properties of odds ratio • Values of father from 1 in a given direction represent stronger association • When one value is the inverse of the other, two values of are the same strength of association, but in the opposite directions • Not changed when the table orientation reverses • Unnecessary to identify one classification as a response variable
Example (revisited) • 1st example • , 95% CI=(1.44, 2.33) • Estimated odds is 83% higher for the placebo group • 2nd example • Rough estimate of RR=3.8 • Women who had ever smoked were about four times as likely to suffer as women who had never smoked
Independence tests • Hypothesis: • Two chi-square tests • Under , estimated expected frequency • Pearson’s = • Likelihood ratio(LR) statistic • For a large sample, follow a chi-squared null distribution with • Remark: When the chi-squared approximation is good. If not, apply Fisher’s exact test
Example: AZT use & AIDS • Development of AIDS symptoms in AZT use and race • Study on the effects of AZT in slowing the development of AIDS symptoms • Data set
Three interests in table • Conditional independence? When controlling for race, AZT treatment and development of AIDS symptom are independent • Use Cochran-Mantel-Haenszel(CMH) test • Summarize the information from partial tables • Homogeneous association? Odds ratios of AZT treatment and development of AIDS symptom are common for each race • Use Breslow-Day test • Common odds ratio? Use Mantel-Haenszel estimate
Example (AZT use & AIDS revisited) • CMH=6.8( =1) with -value=0.0091 • Not independent! • Breslow-Day=1.39( =1) with -value=0.2384 • Homogeneous association! • Common odds ratio=0.49 • For each race, estimated odds of developing symptoms are half as high for those who took AZT
Overview of types of generalized linear models(GLMs) • Three components: Random component (response variable), Linear predictor (linear combination of covariates), Link function • Types of GLMs
Logistic regression with a quantitative covariate • Model: • Another representations • Odds= • Odds at level equals the odds at multiplied by • Curve ascends ( ) or descends ( ) • The rate of change increases as increases
Example: Horseshoe crabs • Binary response • if a female crab has at least one satellite; otherwise • Covariate: female crab’s width • Data set
Goodness-of-fit tests • Working model: number of settings: number of parameters in : • Hypothesis: fits the data • Pearson’s statistic: • Deviance statistic: • approximately follow a chi-square null distribution with
Inference for parameters • Interval estimation: • Two significance tests: • Wald test: Use • Likelihood ratio test: Use , log-likelihood function • Two tests have a large-sample chi-squared null distribution with
Example (Horseshoe crabs revisited) • Fitted model: • : larger at lager width ( ) • There is a 64% increase in estimated odds of a satellite for each centimeter increase in width ( ) • with -value=0.506; with -value=0.4012 • 95% CI for =(0.298, 0.697) • Significance test: Wald=23.9 ( =1) with -value < 0.0001; LRT=31.3 ( =1) with -value < 0.0001
Logistic regression with qualitativepredictors: AIDS symptoms data • Use indicator variables for representing categories of predictors • Logits implied by indicator variables
Logistic regression with qualitativepredictors: AIDS symptoms data • =difference between two logits (i.e., log of odds ratio) at a fixed category of • Homogeneous association model
Equivalence of contingency table & logistic regression • Conditional independence: CMH test vs. • Homogeneous association: Breslow-Day test vs. Goodness-of-fit test • Common odds ratio estimate: Mantel-Haenszel estimate vs.
Logistic regression with mixed predictors: Horseshoe crabs data • For color=medium light, For color=medium, For color=medium dark, • For controlling
Logistic regression: ploytomous • Model categorical responses with more than two categories • Two ways • Use generalized logits function for nominal response • Use cumulative logits function for ordinal response • Notation • number of categories • response probabilities with
Generalized logit model: nominal response • Baseline-category logit: Pair each category with a baseline category • when is the baseline • Model with a predictor • The effects vary according to the category paired with the baseline • These pairs of categories determine equations for all other pairs of categories • Eg, for a pair of categories • Remark: Parameter estimates are same no matter which category is the baseline
Example: Alligator food choice • 59 alligators sample in Lake Gorge, Florida • Response: Primary food type found in alligator’s stomach • Fish(1), Invertebrate(2), Other(3, baseline category) • Predictor: alligator length, which varies 1.24~3.89(m) • ML prediction equations • Larger alligator seem to select fish than invertebrates • Independence test: Food choice & length • LRT=16.8006( ) with -value=0.0002
Cumulative logit model: ordinal response • Logit of a cumulative probability • Categories 1 to : combined, Categories to : combined • Cumulative proportional odds model with a predictor • The effect of are identical for all cumulative logits • Any one curve for is identical to any of others shifted to the right or shifted to the left • For =log of odds ratio is • Proportional to the difference between values • Same for each cumulative probability
Example: Political ideology & party affiliation • Response: Political ideology with five-point ordinal scale • Predictors: Political party(Democratic, Republican)
Example: Political ideology & party affiliation • Parameter inference • , • Democrats tend to be more liberal than Republicans • Wald=57.1( ) with -value < 0.0001 • Strong evidence of an association • 95% CI for =(0.72, 1.23) or =(2.1, 3.4) • At least twice as high for Democrats as for Republicans • Goodness-of-fit • with -value=0.2957 Good adequacy!
Another logit forms for ordinal response categories • Adjacent-categories logit • Adjacent-categories logits determine the logits for all pairs of response categories • Continuation-ratio logit • Form1: • Contrast each category with a grouping of categories from lower levels of response scale • Form2: • Contrast each category with a grouping of categories from higher levels of response scale
Summary • Two-way contingency tables: RR, Odds ratio, Chi-square tests • Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio • Logistic regression: Dichotomous response • Logistic regression: Polytomous response
References • Agresti, A. (1996). An Introduction to Categorical Data Analysis, Wiley: New York (Also the 2nd edition is available) • Stokes, M.E., Davis, C.S., and Koch, G.G. (2000). Categorical Data Analysis Using The SAS System, Second Ed., SAS Inc.: Cary