STA617 Advanced Categorical Data Analysis

STA617Advanced Categorical Data Analysis • Instructor: Changxing Ma Department of Biostatistics 716 Kimball, University at Buffalo Phone: (716) 829-2758 Email: cxma@buffalo.edu • Days, Time: M W, 9:00 AM - 10:20 AM Dates: 08/25/2014 - 12/05/2014 Room: Kimbal 126 • Office Hours:Monday and Wednesday 10:30-11:30 in RM716 Kimball, or by appointment.

STA617 • Course Homepage: http://www.acsu.buffalo.edu/~cxma/STA617/ • Text:Categorical Data Analysis by Alan Agresti (Second Edition, 2002, Wiley, or new edition)Homepage from the author: http://www.stat.ufl.edu/~aa/cda/cda.html • Content: Log linear model, models for matched pairs, analyzing repeated categorical response data, and generalized linear mixed models. We will cover Chapter 8-12 of the textbook. • Computing: SAS

Grading • total 300 points:Homework: 100 pointsProject1: 50 pointsProject2 or Midterm: 50 pointsFinal project/presentation: 100 points. • 5 homework sets, each 25 points, the top 4 scores will be in final homework grade.

Outline of Topics: PART I – Chp7, Chp8 and Chp9 (logistic, loglinear model 8. Loglinear Models for Contingency Tables 8.1 Loglinear Models for Two-Way Tables 8.2 Loglinear Models for Independence and Interaction in Three-Way Tables 8.3 Inference for Loglinear Models 8.4 Loglinear Models for Higher Dimensions 8.5 The Loglinear_Logit Model Connection 8.6 Loglinear Model Fitting: Likelihood Equations and Asymptotic Distributions 8.7 Loglinear Model Fitting: Iterative Methods and their Application

Outline of Topics: 9. Building and Extending Loglinear / Logit Models 9.1 Association Graphs and Collapsibility 9.2 Model Selection and Comparison 9.3 Diagnostics for Checking Models 9.4 Modeling Ordinal Associations

Outline of Topics: Part II: Models for discrete longitudinal data ---matched pairs 10. Models for Matched Pairs 10.1 Comparing Dependent Proportions 10.2 Conditional Logistic Regression for Binary Matched Pairs 10.3 Marginal Models for Square Contingency Tables 10.4 Symmetry, Quasi-symmetry, and Quasiindependence 10.5 Measuring Agreement Between Observers 10.6 Bradley-Terry Model for Paired Preferences 10.7 Marginal Models and Quasi-symmetry Models for Matched Sets

Outline of Topics: --- marginal modeling, GEE, PROC GLIMMIX 11. Analyzing Repeated Categorical Response Data 11.1 Comparing Marginal Distributions: Multiple Responses 11.2 Marginal Modeling: Maximum Likelihood Approach 11.3 Marginal Modeling: Generalized Estimating Equations Approach 11.4 Quasi-likelihood and Its GEE Multivariate Extension: Details 11.5 Markov Chains: Transitional Modeling

Outline of Topics: ---subject-specific models, random-effects modelsPROC GLMMIX, NLMIXED 12. Random Effects: Generalized Linear Mixed Models for Categorical Responses 12.1 Random Effects Modeling of Clustered Categorical Data 12.2 Binary Responses: Logistic-Normal Model 12.3 Examples of Random Effects Models for Binary Data 12.4 Random Effects Models for Multinomial Data 12.5 Multivariate Random Effects Models for Binary Data 12.6 GLMM Fitting, Inference, and Prediction

Chapter 8: Loglinear models for contingency tables

Two-Way Contingency Tables and Their Distributions Table 2.1, a 2X3 contingency table, is from a report on the relationship between aspirin use and heart attacks

Aspirin and Myocardial Infarction Example • The study randomly assigned 1360 patients who had already suffered a stroke to an aspirin treatment (one low-dose tablet a day) or to a placebo treatment. • follow-up 3 years

8.1 Loglinear Models for Two-way Tables

8.1.4 Alternative parameter constrains Constrains Constrains The estimates are different, but contrasts are unique, such as

8.1.5 Multinomial Models for cell probabilities • The intercept parameter cancels in above formula, because this parameter relates purely to the total sample size, which is random in the Poisson model, but fixed in the multinomial model.

8.2 Logistic Models for independence and Interaction in Three-Way Tables (example) • Table 8.3 refers to a 1992 survey by the Wright State University School of Medicine and United Health Service in Dayton Ohio. • 2276 students are asked whether using alcohol, cigarettes, or marijuana in their final year of high school.

8.2 Logistic Models for independence and Interaction in Three-Way Tables • three-way contingency tables: conditional independence and homogeneous association. • 8.2.1 Types of independence or a multinomial distribution with cell probabilities and The three variables are mutually independent when (Section 2.3)

8.2.2 Homogeneous association and three-factor interaction

8.2.3 Interpreting model parameters

8.2.4 Alcohol, cigarette, and marijuana use example • Table 8.3 refers to a 1992 survey by the Wright State University School of Medicine and United Health Service in Dayton Ohio. • 2276 students are asked whether using alcohol, cigarettes, or marijuana in their final year of high school.

SAS code /*data Table 8.3 pp.322*/ data drugs; input a c m count @@; datalines; 1 1 1 911 1 1 2 538 1 2 1 44 1 2 2 456 2 1 1 3 2 1 2 43 2 2 1 2 2 2 2 279 ; procgenmoddata=drugs; class a c m; model count = a c m a*m a*c c*m / dist = poi link = log lrcitype3obstats; odsoutput obstats=obstats; run;

%macro modelbuild(model, varmodel); proc genmod data=drugs; class a c m; model count = a c m &model / dist = poi link = log lrci type3 obstats; ods output obstats=obstats; run; data obstats&varmodel; set obstats (rename=(Pred=&varmodel)); label &varmodel=Predicted &varmodel; keep a c m count Observation &varmodel; run; %mend; %modelbuild(, A_C_M); %modelbuild(A*C, AC_M); %modelbuild(A*M C*M, AM_CM); %modelbuild(A*C A*M C*M, AC_AM_CM); %modelbuild(A*C A*M C*M A*C*M, ACM); • data all; merge obstatsA_C_M obstatsAC_M obstatsAM_CM obstatsAC_AM_CM obstatsACM; by Observation; run;

SAS output

8.3 Inference for Loglinear Models 8.3.1 Chi-squared goodness-of-fit tests • As usual, X2 and G2test whether a model holds by comparing cell fitted values and observed counts. The df equals to the number of cells minus the number of model parameters. df = N − p. • Table 8.6 shows results of testing fit for several loglinear models for the students survey data (see Table 8.3).

%macro modelbuild(model, varmodel); proc genmod data=&data; class &maineffect; model count = &maineffect &model / dist = poi link = log lrci type3 obstats; ods output obstats=obstats Modelfit=Modelfit; run; data obstats&varmodel; set obstats (rename=(Pred=&varmodel)); label &varmodel=Predicted &varmodel; keep a c m count Observation &varmodel; run; data _NULL_; set Modelfit; if Criterion='Deviance' then call symput('G2', Value); if Criterion='Scaled Pearson X2' then do; call symput('chi2', Value); call symput('df',DF);end; data newfit; length model $ 50; model="&varmodel"; G2=&G2; chi2=&chi2; DF=&DF; run; data allfit; set allfit newfit; run; %mend;

%let maineffect=a c m; %let data=drugs; data allfit; run; %modelbuild(, A_C_M); %modelbuild(C*M, A_CM); %modelbuild(A*M, C_AM); %modelbuild(A*C, M_AC); %modelbuild(A*C A*M, AC_AM); %modelbuild(A*C C*M, AC_CM); %modelbuild(A*M C*M, AM_CM); %modelbuild(A*C A*M C*M, AC_AM_CM); %modelbuild(A*C A*M C*M A*C*M, ACM); data allfit; set allfit; pvalue=1-CDF('CHISQUARE', G2, DF); run; /*Table 8.6 pp.324*/ procprintdata=allfit; run;

SAS output

8.3.2 Inference about conditional association

8.4 LOGLINEAR MODELS FOR HIGHER DIMENSIONS • Loglinear models for three-way tables extend to multiway tables. • As the number of dimensions increases, some complications arise. • One is the increase in the number of possible association and interaction terms, making model selection more difficult. • Another is the increase in number of cells. In Section 9.8 we show that this can cause difficulties with existence of estimates and appropriateness of asymptotic theory.

8.4.1 Four-Way Contingency Tables • Four-way table: W, X, Y, and Z • denoted by (WX,WY,WZ, XY, XZ, YZ). • Each pair of variables is conditionally dependent, with the same odds ratios at each combination of categories of the other two variables. • An absence of a two-factor term implies conditional independence, given the other two variables.

8.4.2 Automobile Accident Example • 68,694 passengers in autos and light trucks involved in accidents in the state of Maine in 1991 • Variables: gender G, location of accident L, seat-belt use S, and injury I

%let maineffect=G L S I; %let data=autoaccident; %modelbuild(G*I G*L G*S I*L I*S L*S, I_GL_GS_IL_IS_LS); %modelbuild(G*L*S G*L G*S L*S G*I I*L I*S, GLS_GI_IL_IS); data all; merge obstatsGI_GL_GS_IL_IS_LS obstatsGLS_GI_IL_IS; by Observation; run; /*Table 8.8 pp.327*/ proc print data=all; run; /*table 8.8 pp.327*/ data autoaccident; input G $ L $ S $ x1 x2; I="No "; count=x1; output; I="Yes"; count=x2; output; drop x1 x2; datalines; Female Urban No 7287 996 Female Urban Yes 11587 759 Female Rural No 3246 973 Female Rural Yes 6134 757 Male Urban No 10381 812 Male Urban Yes 10969 380 Male Rural No 6123 1084 Male Rural Yes 6693 513 ;

SAS output

data allfit; run; %modelbuild(, model1); %modelbuild(G*I G*L G*S I*L I*S L*S, model2); %modelbuild(G|I|L G|I|S G|L|S I|L|S, model3); %modelbuild(G|I|L G*S I*S L*S, model4); %modelbuild(G|I|S G*L I*L L*S, model5); %modelbuild(G|L|S G*I I*L I*S, model6); %modelbuild(I|L|S G*I G*L G*S, model7); data allfit; set allfit; pvalue=1-CDF('CHISQUARE', G2, DF); run; /*Table 8.9 pp.327*/ procprintdata=allfit; run;

Loglinear model fits SAS:

STA617 Advanced Categorical Data Analysis

STA617 Advanced Categorical Data Analysis

Presentation Transcript

Categorical Data Analysis

Chapter 16 – Categorical Data Analysis

Introduction to Categorical Data Analysis

Categorical Data Analysis

Categorical Data

Categorical Data

Analysis of Categorical Data

Categorical Data

INTRODUCTION TO CATEGORICAL DATA ANALYSIS

Categorical Data

Categorical Data

Categorical Data

Categorical Data Analysis

Categorical Data

Categorical Data Analysis PGRM 14

The Analysis of Categorical Data

Categorical Data Analysis

The Analysis of Categorical Data

Categorical Data Analysis

INTRODUCTION TO CATEGORICAL DATA ANALYSIS

Categorical data