320 likes | 467 Views
Stats 330: Lecture 29. More on Contingency Tables. Plan of the day. In today’s lecture we continue our discussion of contingency tables. Topics Conditional and unconditional Odds ratios The product multinomial sampling model Testing goodness of fit.
E N D
Stats 330: Lecture 29 More on Contingency Tables
Plan of the day In today’s lecture we continue our discussion of contingency tables. Topics • Conditional and unconditional Odds ratios • The product multinomial sampling model • Testing goodness of fit
Odds ratios in three dimensional tables: Florida murders Conditional table for Victim = White OR = 23*265/(39*105) = 1.488400 log(OR) = 0.3977
Example: Florida murders glm(formula = counts ~ victim * defendant * dp, family = poisson, data = murder.df) Coefficients: Estimate (Intercept) 2.5649 victimw 0.5705 defendantw -2.5649 dpn 2.7081 victimw:defendantw 3.0930 victimw:dpn -1.1896 defendantw:dpn 0.2364 victimw:defendantw:dpn 0.1613 0.2364 + 0.1613 = 0.3977
Estimating odds ratios: 3 dimensional table In a 3-dimensional table, fitting the maximal model: • The log of the conditional odds ratio equals the corresponding interaction in the maximal model, plus the 3 factor interaction • The log of the conditional odds ratio equals the corresponding interaction in the homogeneous association model (3 f.i. is zero) • Also can calculate OR directly from the conditional tables, as on previous slide.
Example: Florida murders glm(formula = counts ~ victim * defendant * dp, family = poisson, data = murder.df) Coefficients: Estimate (Intercept) 2.5649 victimw 0.5705 defendantw -2.5649 dpn 2.7081 victimw:defendantw 3.0930 victimw:dpn -1.1896 defendantw:dpn 0.2364 victimw:defendantw:dpn 0.1613 0.2364 + 0.1613 = 0.3977
The product multinomial sampling model • So far, we have assumed that we are sampling at random from a single population. All the factors have been “responses” (although we didn’t analyse them as such; we used the “Poisson trick”) • Nevertheless, we were examining the joint distribution of the factors.
The product multinomial sampling model (cont) • Sometimes, we want to think of one factor A say, as representing different populations. We might want to compare the distribution of another factor B in the different populations, by taking a separate sample from each population Thus, some of the margins are fixed, not random. • In this case, we want to study the conditional distribution of B, given A.
Literary Example • Many literary scholars have used sentence length distributions (and similar statistical measures) to characterise the work of classical writers, with a view to attributing authorship. • For this to be sensible, the consistency of sentence length distributions across the whole body of work of a writer has to be established. • The table on the next slide gives the sentence length distributions on samples from each of 9 works by the Greek historian Herodotus. We want to test if the population distributions in the 9 works are identical. • Here, the factor B is the number of words, and the factor A the work.
The product multinomial sampling model (cont) • This type of sampling, where some margins are fixed, is called product multinomial sampling. • The likelihood for this type of sampling is the same as before, so we can ignore the fact that some margins are fixed. • The distribution of B will be the same for all levels of A if and only if A and B are independent. • Thus, to see if the distribution of B is the same for all levels of A, we test the independence of A and B in the joint distribution – ie test for an interaction between A and B.
Literary example: points to note • 9 random samples, not 1 • Bottom totals are fixed • Product-multinomial: 9 separate multinomials, get likelihood by multiplying them together • Can test if all 9 distributions are the same by • Fitting Poisson model using length and work as categorical explanatory variables, count as a response • Testing for an interaction between length and work.
Getting the data into R • Often the data is presented in the form of a table. • We need to transform it into a data frame with 3 variables count, work and length. • Can do this in various ways, but using expand.grid is perhaps the easiest.
The data in table form > temp.df Length W1 W2 W3 W4 W5 W6 W7 W8 W9 1 1-5 16 6 7 8 13 8 10 9 13 2 6-10 35 33 48 38 47 30 36 42 45 3 11-14 47 34 37 49 43 35 46 43 54 4 15-20 39 39 34 38 32 41 38 33 28 5 21-25 24 26 21 24 26 29 23 27 23 6 26-30 16 22 23 15 14 30 19 18 14 7 31-35 11 19 6 15 9 10 10 10 6 8 36-40 5 5 11 8 2 3 4 9 9 9 41-45 4 4 6 2 5 6 4 4 4 10 46-50 1 6 2 0 4 5 5 3 0 11 50+ 2 6 5 3 5 3 5 2 4 12 Total 200 200 200 200 200 200 200 200 200
Creating the input data frame length.labels<-as.vector(temp.df[-12,1]) work.labels<-paste(“W”,1:9, sep=“”) input.df<-data.frame(expand.grid(lengths=length.labels, work=work.labels), counts=unlist(temp.df[-12,-1]), row.names=1:99) Ouch! What are these !! > as.vector(temp.df[-12,1]) [1] "1-5""6-10""11-14""15-20""21-25""26-30“ "31-35“ "36-40" "41-45" [10] "46-50" "50+“ > paste("W",1:9,sep="") [1] "W1" "W2" "W3" "W4" "W5" "W6" "W7" "W8" "W9“ > unlist(temp.df[-12,-1]) all the counts in the form of a vector, column by column
Form of input.df > input.df lengths work counts 1 1-5 W1 16 2 6-10 W1 35 3 11-14 W1 47 4 15-20 W1 39 5 21-25 W1 24 6 26-30 W1 16 7 31-35 W1 11 8 36-40 W1 5 9 41-45 W1 4 10 46-50 W1 1 11 50+ W1 2 12 1-5 W2 6 13 6-10 W2 33 And so on . . .
Graphical comparison of frequencies • Use barplots, also mosaic plots (see 201/8) • Need to • calculate frequencies • “transpose the matrix” • label rows and columns • Draw the plot frequencies<-t(as.matrix(temp.df[-12,-1]))/200 dimnames(frequencies)<-list(work.labels, length.labels) barplot(frequencies,beside=T, legend.text=work.labels, space=c(0,5),xlab = "sentence length", ylab="frequencies”,col=heat.colors(9))
barplot(frequencies,beside=T, legend.text=work.labels, space=c(0,5),xlab = "sentence length", ylab="frequencies”,col=heat.colors(9))
Mosaic plot my.table = matrix(input.df$counts, nrow=11, ncol=9) dimnames(my.table) = list(length.labels, work.labels) mosaicplot(t(my.table), main="mosaic plot of counts")
Analysis Maximal model > herodotus.glm<-glm(counts~work*lengths,family=poisson,data=input.df) > anova(herodotus.glm, test="Chisq") Analysis of Deviance Table Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 98 1253.82 work 8 0.00 90 1253.82 1.00 lengths 10 1159.75 80 94.07 6.925e-243 work:lengths 80 94.07 0 1.816e-05 0.13 > 1-pchisq(94.07,80) [1] 0.1345596 Interaction is non-significant, indicating no significant differences between works
Testing goodness of fit • In the previous example, we concluded that the distribution of sentence lengths was the same over the 9 works studied. • This raises the question, can we model the common distribution by some standard distribution such as the Poisson or the negative binomial? • We will look at the general problem of deciding if a given distribution fits a set of count data.
General solution • Suppose the observations are counts, and we can represent the data in the form of a contingency table, as in the last example. • We want to test if the distribution of the count data is some given distribution, say the Poisson. • Using this distribution, we can work out the probability of “hitting” a particular cell, and hence calculate the log-likelihood log Lmod. • Using the deviance, we can test if the Poisson model is adequate.
Example Flying bombs on London • In 1944, the German army fired 6725 “flying bombs” at London. • The British Government needed to determine how accurate these bombs were. • The locations of the bomb hits in an area of South London were recorded and analysed to determine this.
Data collection Area in South London divided up into a square 24 x 24 grid, 576 squares in all Number of bomb hits in each square recorded
Rationale • If the aiming was random, the hits should make a random pattern in the 576 squares. • In this case, theory indicates that the distribution of the number of hits in a square should be Poisson. • We can check how accurately the German forces could aim by checking if the distribution is Poisson.
Analysis • The total number of squares is 576. • The mean number of hits per square is [(0´229) + (1´211) + (2´93) + (3´35) + (4´7) + (5´1)] /576=0.9288 no.of.squares<-c(0,1,2,3,4,5) no.of.hits<-c(229,211,93,35,7,1) mean.hits<-sum(no.of.squares*no.of.hits)/ sum(no.of.hits) > mean.hits [1] 0.9288194
Analysis (cont) • The Poisson probabilities are calculated using dpois: poisson.probs<-dpois(0:5,mean.hits) freqs<-no.of.hits/576 Can compare the Poisson and maximal probabilities with a barplot
Analysis (cont) • The log-likelihoods for the maximal and Poisson models are of the form Siyi log pi where pi is estimated by the frequencies and Poisson probabilities respectively poisson.probs<-dpois(0:5,mean.hits) freqs<-no.of.hits/576 Deviance<-2*(sum(no.of.hits*log(freqs)) - sum(no.of.hits*log(poisson.probs))) > Deviance [1] 1.499450 > 1-pchisq(Deviance,4) [1] 0.8267388
Conclusions • The agreement is remarkably good, with a large p-value • Bombs seem to be aimed randomly • Degrees of freedom are 4 • maximal model has m -1=6 -1=5 df, since there are 5 unknown parameters p1, … p5 • Poisson model has 1 df, since there is 1 unknown parameter, the Poisson mean