440 likes | 533 Views
Topic 2. LOGIT analysis of contingency tables. Contingency table. a cross classification Table containing two or more variables of classification, and the purpose is to determin if these variables are related. Change in stock prices in year. Change in stock prices in January.
E N D
Topic 2 LOGIT analysis of contingency tables
Contingency table • a cross classification Table containing two or more variables of classification, and the purpose is to determin if these variables are related. Change in stock prices in year Change in stock prices in January UP DOWN TOTAL UP DOWN TOTAL 22 (16.1) 1 (6.9) 23 6 (11.9) 11 (5.1) 17 28 12 40
A table of this sort can be used to test whether, as some financial analysts suggest, January is a good prediction of whether stock prices will go up or down in the entire year • H0 : whether or not stock prices go up in the entire year is the same regardless of the behaviour in January • H1 : otherwise • Expected frequencies are shown in parentheses in the table
Pearson’s Chi-square statistic where r and c are respectively the numbers of rows and columns in the table
In our example, Now we rejected the null. In other words, based on this evidence the probability that stock prices will go up during the whole year does not seem to be independent of whether or not they go up in January
DATA STOCK; INPUT F YP JP; DATALINES; 22 1 1 6 1 0 1 0 1 11 0 0 ; PROC FREQ DATA=STOCK; WEIGHT F; TABLES YP*JP/CHISQ CMH; RUN;
Two Way Table • Consider the following SAS program and OUTPUT: DATA PENALTY; INFILE 'D:\TEACHING\MS4225\PENALTY.TXT'; INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2; PROC GENMOD DATA=PENALTY DESCENDING; MODEL DEATH=BLACKD/D=B; RUN;
But suppose we don’t have individual level data. All we have is the following table
DATA CONT1; INPUT F BLACKD DEATH; DATALINES; 22 0 1 28 1 1 52 0 0 45 1 0 ; PROC GENMOD DATA=CONT1 DESCENDING; FREQ F; MODEL DEATH=BLACKD/D=B; RUN;
Results are identical to those obtained previously • Alternatively, we can run the program DATA CONT1; INPUT DEATH TOTAL BLACKD; DATALINES; 22 74 0 28 73 1 ; PROC GENMOD DATA=CONT1; MODEL DEATH/TOTAL=BLACKD/D=B; RUN;
Points to note: • Instead of replicating the observations, GENMOD treats the variable DEATH as having a Binomial distribution with the number of trials given by TOTAL. • Deviance is 0. Why? Note that the deviance is a likelihood ratio test that compares the fitted model with a saturated model. In the previous case, the saturated model is also the fitted model, with two parameter for two data lines.
Three Way Table • Consider the cross classification table of race, gender and possession of a driver’s license for a sample of 17 and 18 year old kids.
DATA DRIVER; INPUT WHITE MALE YES NO; TOTAL = YES+NO; DATALINES; 1 1 43 134 1 0 26 149 0 1 29 23 0 0 22 36 ; PROC GENMOD DATA=DRIVER; MODEL YES/TOTAL=WHITE MALE/D=B; RUN;
Deviance = 0.0583 with a p-value of 0.8092033193 • It can be obtained by executing the SAS program: DATA; CHI = 1 – PROBCHI(0.0583,1); PUT CHI; RUN; • So there is no evidence of an interaction between the explanatory variables.
To see this more explicitly, let us fit the model with interaction DATA DRIVER; INPUT WHITE MALE YES NO; TOTAL = YES+NO; DATALINES; 1 1 43 134 1 0 26 149 0 1 29 23 0 0 22 36 ; PROC GENMOD DATA=DRIVER; MODEL YES/TOTAL=WHITE MALE WHITE*MALE/D=B; RUN;
Interpretation • Coefficient of MALE is 0.6478 Exponentiating the coefficient yields 1.91 => the estimated odds of having a driver’s license are nearly twice as large for males as for females, after adjusting for racial differences.
For WHITE, the highly significant, adjusted odds ratio is exp[-1.3135]=0.269, indicating that the odds of having a driver’s license for whites is a little more than ¼ the odds of blacks.
Four Way Table • Slightly more complicated with four-way tables because more interactions are possible • Consider the following table • Our goal is to estimate a LOGIT model for the dependence of working class identification on the other three variables.
DATA WORKING; INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING; DATALINES; 1 1 1 107 85 1 1 0 65 44 1 0 1 66 24 1 0 0 171 17 0 1 1 87 24 0 1 0 65 22 0 0 1 85 1 0 0 0 148 6 ; PROC GENMOD DATA=WORKING; MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL/D=B; RUN;
The missing variables are the interaction terms: 3 2-way interactions and 1 3-way interaction. Because 3-way interactions cannot be interpreted easily, let’s see if we can get by with just the 2-way interactions.
DATA WORKING; INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING; DATALINES; 1 1 1 107 85 1 1 0 65 44 1 0 1 66 24 1 0 0 171 17 0 1 1 87 24 0 1 0 65 22 0 0 1 85 1 0 0 0 148 6 ; PROC GENMOD DATA=WORKING; MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL FRANCE*MANUAL FRANCE*FAMANUAL MANUAL*FAMANUAL/D=B; RUN;
Examining the Wald Chi-squares, we find that FRANCE*FAMANUAL is highly significant, but other interaction variables are not so significant.
DATA WORKING; INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING; DATALINES; 1 1 1 107 85 1 1 0 65 44 1 0 1 66 24 1 0 0 171 17 0 1 1 87 24 0 1 0 65 22 0 0 1 85 1 0 0 0 148 6 ; PROC GENMOD DATA=WORKING; MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL FRANCE*FAMANUAL/D=B; RUN;
Interpretations of results • Coefficient for MANUAL: exp(2.5155) = 12.4 => Manual workers have an odds of identification with the working class that is more than 12 times the odds for non-manual workers • Coefficient for FRANCE*FAMANUAL:
If FRANCE=0, then f(.)[-0.3802] represents the effect of FAMANUAL when the respondent lives in the U.S. • If FRANCE=1, then f(.)[1.13] represents the effect of RAMANUAL when the respondent lives in France, exp[1.13]=3.1 • In France, the men whose fathers had a manual occupation have an odds of identification that is more than three times the odds for men whose fathers did not have a manual occupation.
Overdispersion • Refers to the situation of lack of fit • Causes of overdispersion: • Incorrectly specified model: more interactions or nonlinearity are needed in the model. • Lack of independence of observations due to unobserved heterogeneity at group level.
DATA POSTDOC; INPUT NIH DOCS PDOCS; DATALINES; .5 8 1 .5 9 3 .835 16 1 .998 13 6 1.027 8 2 2.036 9 2 2.106 29 10 . . . 2.329 5 2 13.749 12 7 14.367 29 21 14.698 19 5 15.440 10 6 17.417 10 8 18.635 14 9 21.524 18 16 ; PROC GENMOD DATA=POSTDOC; MODEL PDOCS/DOCS=NIH /D=B; RUN;
Note that the deviance and Pearson c2 clearly indicate model mis-specification • Because there’s only one independent variable, we don’t have the option of putting in interactions • One can try allowing for nonlinearity by including powers of NIH in the model by that won’t help. • It is quite possible that lack of fit is due to a lack of independence in the observations
There are many characteristics of biochemistry departments besides NIH funding that may have some bearings on whether their graduates seek and get postdoctoral training Examples are prestiage of the department, whether the department is in an agricultural or medical school, the age of the department and so on. • Lack of independence of this kind produces what is called extra-binomial variation. The variance of the dependent variable will be greater than what is expected under the assumption of a binomial distribution.
Besides producing a large deviance, extra-binomial variation can result in underestimates of the standard errors and overestimates of the Chi-square statistics. Method of adjustment: take the square root of the Pearson Chi-square statistic and multiply all the standard errors by that number.
DATA POSTDOC; INPUT NIH DOCS PDOCS; DATALINES; .5 8 1 .5 9 3 .835 16 1 .998 13 6 1.027 8 2 2.036 9 2 . . . 13.749 12 7 14.367 29 21 14.698 19 5 15.440 10 6 17.417 10 8 18.635 14 9 21.524 18 16 ; PROC GENMOD DATA=POSTDOC; MODEL PDOCS/DOCS=NIH /D=B PSCALE; RUN;