350 likes | 636 Views
Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013. Overview. Data Types Contingency Tables Logit Models Binomial Ordinal Nominal. Things not covered (but still fit into the topic). Matched pairs/repeated measures
E N D
Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013
Overview • Data Types • Contingency Tables • Logit Models • Binomial • Ordinal • Nominal
Things not covered (but still fit into the topic) • Matched pairs/repeated measures • McNemar’sChi-Square • Reliability • Cohen’s Kappa • ROC • Poisson (Count) models • Categorical SEM • TetrachoricCorrelation • Bernoulli Trials
Data Types (Levels of Measurement) Discrete/Categorical/Qualitative Continuous/Quantitative Nominal/Multinomial: Rank Order/Ordinal: Binary/Dichotomous/Binomial: • Properties: • Values arbitrary (no magnitude) • No direction (no ordering) • Example: • Race: 1=AA, 2=Ca, 3=As • Measures: • Mode, relative frequency • Properties: • Values semi-arbitrary (no magnitude?) • Have direction (ordering) • Example: • Lickert Scales (LICK-URT): • 1-5, Strongly Disagree to Strongly Agree • Measures: • Mode, relative frequency, median • Mean? • Properties: • 2 Levels • Special case of Ordinal or Multinomial • Examples: • Gender (Multinomial) • Disease (Y/N) • Measures: • Mode, relative frequency, • Mean?
Code 1.1 Contingency Tables • Often called Two-way tables or Cross-Tab • Have dimensions I x J • Can be used to test hypotheses of association between categorical variables
Contingency Tables: Test of Independence • Chi-Square Test of Independence (χ2) • Calculate χ2 • Determine DF: (I-1) * (J-1) • Compare to χ2 critical value for given DF. R1=156 R2=664 N=820 C1=265 C2=331 C3=264 Where: Oi = Observed Freq Ei= Expected Freq n= number of cells in table
Code 1.2 Contingency Tables: Test of Independence • Pearson Chi-Square Test of Independence (χ2) • H0: No Association • HA: Association….where, how? • Not appropriate when Expected (Ei) cell size freq < 5 • Use Fisher’s Exact Chi-Square R1=156 R2=664 N=820 C1=265 C2=331 C3=264
Contingency Tables • 2x2 Disorder (Outcome) Yes No a b Yes a+b c d Risk Factor/ Exposure c+d No a+c b+d a+b+c+d
Contingency Tables:Measures of Association Depression Probability : Contrasting Probability: Yes No a= 25 b= 10 Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol 35 Yes c= 20 d= 45 Alcohol Use Contrasting Odds: Odds: 65 No The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 45 55 100
Why Odds Ratios? i=1 to 45 (20 + 45*i) Depression (45 + 55*i) Yes No a= 25 b= 10*i (25 + 10*i) Yes c= 20 d= 45*i Alcohol Use No 45 55*i
The GeneralizedLinear Model • General Linear Model (LM) • Continuous Outcomes (DV) • Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA • GeneralizedLinear Model (GLM) • John Nelder and Robert Wedderburn • Maximum Likelihood Estimation • Continuous, Categorical, and Count outcomes. • Distribution Family and Link Functions • Error distributions that are not normal
Logistic Regression • “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2nd Ed.) • Binary Response • Predicting Probability (related to the Probit model) • Assume (the usual): • Independence • NOT Homoscedasticity or Normal Errors • Linearity (in the Log Odds) • Also….adequate cell sizes.
Logistic Regression • The Model • In terms of probability of success π(x) • In terms of Logits (Log Odds) • Logit transform gives us a linear equation
Code 2.1 Logistic Regression: Example The Output as Logits • Logits: H0: β=0 Freq. Percent Not Depressed 672 81.95 Depressed 148 18.05 • Conversion to Probability: What does H0: β=0 mean? • Conversion to Odds • Also=0.1805/0.8195=0.22
Code 2.2 Logistic Regression: Example • The Output as ORs • Odds Ratios: H0: β=1 • Conversion to Probability: • Conversion to Logit (log odds!) • Ln(OR) = logit • Ln(0.220)=-1.51 Freq. Percent Not Depressed 672 81.95 Depressed 148 18.05
Code 2.3 Logistic Regression: Example Logistic Regression w/ Single Continuous Predictor: AS LOGITS: Interpretation: A 1 unit increase in age results in a 0.013 increase in the log-odds of depression. Hmmmm….I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = 1.013 For a 1 unit increase in age, there is a 1.013 increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change]
Logistic Regression: GOF • Overall Model Likelihood-Ratio Chi-Square • Omnibus test for the model • Overall model fit? • Relative to other models • Compares specified model with Null model (no predictors) • Χ2=-2*(LL0-LL1), DF=K parameters estimated
Code 2.4 Logistic Regression: GOF (Summary Measures) • Pseudo-R2 • Not the same meaning as linear regression. • There are many of them (Cox and Snell/McFadden) • Only comparable within nested models of the same outcome. • Hosmer-Lemeshow • Models with Continuous Predictors • Is the model a better fit than the NULL model. X2 • H0: Good Fit for Data, so we want p>0.05 • Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 • Conservative (rarely rejects the null) • Pearson Chi-Square • Models with categorical predictors • Similar to Hosmer-Lemeshow • ROC-Area Under the Curve • Predictive accuracy/Classification
Code 2.5 Logistic Regression: GOF(Diagnostic Measures) • Outliers in Y (Outcome) • Pearson Residuals • Square root of the contribution to the Pearson χ2 • Deviance Residuals • Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model. • Outliers in X (Predictors) • Leverage (Hat Matrix/Projection Matrix) • Maps the influence of observed on fitted values • Influential Observations • Pregibon’s Delta-Beta influence statistic • Similar to Cook’s-D in linear regression • Detecting Problems • Residuals vs Predictors • Leverage VsResiduals • Boxplot of Delta-Beta
Logistic Regression: GOF L-R χ2 (df=1): 2.47, p=0.1162 H-L GOF: Number of Groups: 10 H-L Chi2: 7.12 DF: 8 P: 0.5233 McFadden’s R2: 0.0030
Code 2.6 Logistic Regression: Diagnostics • Linearity in the Log-Odds • Use a lowess (loess) plot • Depressed vs Age
Code 2.7 Logistic Regression: Example Logistic Regression w/ Single Categorical Predictor: AS OR: Interpretation: The odds of depression are 0.299 times lower for males compared to females. We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females. Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34.
Ordinal Logistic Regression • Also called Ordered Logistic or Proportional Odds Model • Extension of Binary Logistic Model • >2 Ordered responses • New Assumption! • Proportional Odds • BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) • The predictors effect on the outcome is the same across levels of the outcome. • Bmi3grp (1 vs 2,3) = B(age) • Bmi3grp (1,2 vs 3) = B(age)
Ordinal Logistic Regression • The Model • A latent variable model (Y*) • j= number of levels-1 • From the equation we can see that the odds ratio is assumed to be independent of the category j
Code 3.1 Ordinal Logistic Regression Example AS LOGITS: For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higherbmi category AS OR: For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater.
Code 3.2 Ordinal Logistic Regression: GOF • Assessing Proportional Odds Assumptions • Brant Test of Parallel Regression • H0: Proportional Odds, thus want p >0.05 • Tests each predictor separately and overall • Score Test of Parallel Regression • H0: Proportional Odds, thus want p >0.05 • Approx Likelihood-ratio test • H0: Proportional Odds, thus want p >0.05
Code 3.3 Ordinal Logistic Regression: GOF • Pseudo R2 • Diagnostics Measures • Performed on the j-1 binomial logistic regressions
Multinomial Logistic Regression • Also called multinomial logit/polytomous logistic regression. • Same assumptions as the binary logistic model • >2 non-ordered responses • Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model
Multinomial Logistic Regression • The Model • j= levels for the outcome • J=reference level • where x is a fixed setting of an explanatory variable • Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR. • Similar to conducting separate binary logistic models, but with better type 1 error control
Code 4.1 Multinomial Logistic Regression Example Does degree of supernatural belief indicate a religious preference? AS OR: For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic.
Multinomial Logistic Regression GOF • Limited GOF tests. • Look at LR Chi-square and compare nested models. • “Essentially, all models are wrong, but some are useful” –George E.P. Box • Pseudo R2 • Similar to Ordinal • Perform tests on the j-1 binomial logistic regressions
Resources “Categorical Data Analysis” by Alan Agresti UCLA Stat Computing: http://www.ats.ucla.edu/stat/