220 likes | 488 Views
Modeling Correlated/Clustered Multinomial Data. Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County. Probability and Statistics Day, April 28, 2007.
E N D
Modeling Correlated/Clustered Multinomial Data Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County Probability and Statistics Day, April 28, 2007 Joint Research with Professor Nagaraj K. Neerchal, UMBC and Jorge G. Morel, PhD, P&G Pharmaceuticals, Inc.
Motivation Example – Forrest Pollen Count, Mosimann (1962) • In the analysis of forest pollen, counts of the frequency of occurrence of different kinds of pollen grains are made at various levels of a sediment core • An attempt is then made to reconstruct the past vegetation changes in the area from which the core was taken 2
Motivation Example – Forrest Pollen Count, Mosimann (1962) • Four arboreal types of fossil forest pollen (pine, fir, oak and alder) were counted in the Bellas Artes core from the Valley of Mexico • At various levels of the core, pollen was classified in clusters of 100 pollen grains • The Data: 3
Motivation The Multinomial Model • The probability function: • Key assumptions: • Each observation can be classified by exactly one of k possible outcomes, with probabilities 1,..., k • All observations are independent of each other • In our example, since each pollen count comes from a cluster of 100 pollen grains, the individual observations within a cluster can be expected to be correlated • The possible correlations are a violation of the multinomial model assumptions! 4
Motivation Problem Statement • How can we properly model these data and estimate the proportions of pollen grains? • What are the effects of using the wrong model? 5
Overdispersion (Extra Variation) Overview • Data exhibit variances larger than that permitted by the multinomial model • Usually caused by a lack of independence or clustering of experimental units • “Overdispersion is not uncommon in practice. In fact, some would maintain that over-dispersion is the norm in practice and nominal dispersion the exception.” • McCullagh and Nelder (1989) 6
Overdispersion (Extra Variation) Multinomial Overdispersion • Usually characterized by the first two moments • The quantity {1+ 2(m – 1)} is known as the design effect (Kish, 1965). • The parameter is known as the “intra class” or “intra cluster” correlation • We use to denote a positive intra cluster correlation which corresponds to overdispersion 7
Parameter Estimation • How can we properly model these data and estimate the proportions of pollen grains? • Moment Based • Likelihood Based Quasi-Likelihood (Easily implemented in SAS – Proc Genmod) Generalized Estimating Equations Finite Mixture Distribution (Not currently in SAS – Must write your own code) Dirichlet Multinomial Distribution 8
Quasi-Likelihood Estimation Wedderburn (1974), Cox and Snell (1989) • Here we assume that overdispersion occurs by inflation of variances by a constant factor • Estimate systematic structure of the model via maximum likelihood procedures • Inflate the variance by a suitable constant 9
Generalized Estimating Equations (GEE) Liang and Zeger (1986), Zeger and Liang (1986) • Extension of Quasi-likelihood to clustered and longitudinal data: • The Generalized Estimating Equations are: 10
Likelihood Models for Correlated Multinomial Dirichlet Multinomial Distribution, Mosimann (1962) • Multinomial Distribution with a Dirichlet Prior 11
Likelihood Models for Correlated Multinomial Dirichlet Multinomial Distribution, Mosimann (1962) • It can be shown that • If we let then the moments of the Dirichlet Multinomial distribution are given by 12
Likelihood Models for Correlated Multinomial Finite Mixture of Multinomials, Morel & Neerchal (1993) • Can be represented as: T=YN+X|N • N Binomial(, m), Y Multinomial(, 1), N Y • (X|N) Multinomial(, m-N ) if N < m 13
Likelihood Models for Correlated Multinomial Finite Mixture of Multinomials, Morel & Neerchal (1993) • It can be shown that: • If and, • Then the moments of the Finite Mixture distribution are given by, 14
Maximum Likelihood Estimation Overview • Computed using the Fisher Scoring Algorithm: • Fisher Information Matrix plays an important role • Can be computationally challenging • Approximations are available • Dirichlet Multinomial FIM can be computed using marginal Beta-Binomial moments 15
Maximum Likelihood Estimation Example – Forrest Pollen Count, Mosimann (1962) • Maximum Likelihood Estimation results under the Finite Mixture and Dirichlet Multinomial Distributions • The naïve model underestimates the standard errors • The FM model gives smaller standard errors for the estimates of (pine) (fir) (oak) (alder) 4 = 1-(1 + 2 + 3) 16
Maximum Likelihood Estimation Simulation Study • What are the effects of using the wrong model? • After each simulation, we calculate the average of the determinants from each model • A comparison of these averages gives us insight as to which model may be more efficient 17
Maximum Likelihood Estimation Simulation Study • The Joint Asymptotic Relative Efficiency (JARE) can be used to summarize the simulation results as it indicates which estimate would have a smaller asymptotic variance • For a vector parameter, JARE is the ratio of the determinants of the asymptotic variance-covariance matrices 18
Conclusions • If we observe correlated/clustered multinomial data, use of the naïve multinomial model causes the standard errors to be underestimated which leads to erroneous inferences and inflated Type-I error rates • If the data truly comes from a Finite Mixture distribution, then estimation using this model clearly outperforms the Dirichlet Multinomial in terms of efficiency • If we are unsure of the distribution, the FM model may underestimate the standard errors and the Dirichlet Multinomial model provides a safe alternative 19
Future Work Extension to Include Covariates • Covariates can be included and linked to the model parameters through “link” functions as in the Generalized Linear Model (GLM) framework • Obtain the expressions for the efficiency of likelihood models relative to GEE • Use simulations to see if gains in efficiency of the likelihood models can be achieved over GEE • Does the inclusion of covariates change our conclusions? • Does the choice of link function have an influence? Simulation Study 20
References Cox, D.R. and Snell, E.J. (1989) Analysis of Binary Data. 2nd Ed. New York: Chapman and Hall. Kish, L. (1965) Survey Sampling. New York: John Wiley & Sons. Liang, K.Y. and Zeger, S.L. (1986) “Longitudinal data analysis using generalized linear models.” Biometrika 73: 13-22. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. 2nd Ed. London: Chapman and Hall. Morel, J.G. and Nagaraj, N.K. (1993) “A finite mixture distribution for modelling multinomial extra variation.” Biometrika 80: 363-371. Mosimann, J. E. (1962) “On the Compound Multinomial Distribution, the Multivariate -distribution, and Correlation among Proportions,” Biometrika, 49: 65-82. Neerchal, N.K. and Morel, J.G. (1998) “Large cluster results for two parametric multinomial extra variation models.” Journal of the American Statistical Association 93: 1078-1087. Wedderburn, R.W.M. (1974) “Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method.” Biometrika 61: 439-447. Zeger, S.L. and Liang, K.Y. (1986) “Longitudinal data analysis for discrete and continuous outcomes.” Biometrics 42: 121-130. 21