Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures

Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005

Introduction • Concept drift is in the framework of generalized linear mixed model, but brings new question of exploiting the structuring of auxiliary data. • Mixtures with a countably infinite number of components can be handled in a Bayesian framework by employing Dirichlet process priors.

Outline • Part I: generalized linear mixed model • Generalized linear model (GLM) • Generalized linear mixed model (GLMM) • Advanced applications • Bayesian feature selection in GLMM • Part II: nonparametric method • Chinese restaurant process • Dirichlet process (DP) • Dirichlet process mixture models • Variational inference for Dirichlet process mixtures

Part I Generalized Linear Mixed Model

Generalized Linear Model (GLM) • A linear model specifies the relationship between a dependent (or response) variable Y, and a set of predictor variables, Xs, so that • GLM is a generalization of normal linear regression models to exponential family (normal, Poisson, Gamma, binomial, etc).

Generalized Linear Model (GLM) GLM differs from linear model in two major respects: • The distribution of Y can be non-normal, and does not have to be continuous. • Y still can be predicted from a linear combination of Xs, but they are "connected" via a link function.

Generalized Linear Model(GLM) DDE Example: binomial distribution • Scientific interest: does DDE exposure increase the risk of cancer? Test on rats. Let i index rat. • Dependent variables: • Independent variable: dose of DDE exposure, denoted by xi.

Generalized Linear Model(GLM) • Likelihood function of yi: • Choosing the canonical link , the likelihood function becomes

GLMM – Basic Model Returning to the DDE example, 19 labs all over the world participated this bioassay. • There are unmeasured factors that vary between the different labs. • For example, rodent diet. • GLMM is an extension of the generalized linear model by adding random effects to the linear predictor (Schall 1991).

GLMM – Basic Model • The previous linear predictor is modified as: , where index lab, index rat within lab . • are “fixed” effects - parameters common to all rats. • are “random” effects - deviations for lab i.

GLMM – Basic Model • If we choose xij = zij , then all the regression coefficients are assumed to vary for the different labs. • If we choose zij = 1, then only the intercept varies for the different labs (random intercept model).

GLMM - Implementation • Gibbs sampling Disadvantage: slow convergence. Solution: hierarchical centering reparametrisation (Gelfand 1994; Gelfand 1995) • Deterministic methods are only available for logit and probit models. • EM algorithm (Anderson 1985) • Simplex method (Im 1988)

GLMM – Advanced Applications • Nested GLMM: within each lab, rats were group housed with three cats per cage. let i index lab, j index cage and k index rat. • Crossed GLMM: for all labs, four dose protocols were applied on different rats. let i index lab, j index rat and k indicate the protocol applied on rat i,j.

GLMM – Advanced Applications • Nested GLMM: within each lab, rats were group housed with three cats per cage. Two-level GLMM: level I – lab, level II – cage. • Crossed GLMM: for all labs, four dose protocols were applied on different rats. • Rats are sorted into 19 groups by lab. • Rats are sorted into 4 groups by protocol.

GLMM – Advanced Applications • Temporal/spatial statistics: Account for correlation between the random effects at different times/locations. • Dynamic latent variable model (Dunson 2003) Let i index patient and t index follow-up time,

GLMM – Advanced Applications • Spatially varying coefficient processes (Gelfand 2003): random effects are modeled as spatially correlated process. Possible application: A landmine field where landmines tend to be close together.

Bayesian Feature Selection in GLMM Simultaneous selection of fixed and random effects in GLMM (Cai and Dunson 2005) • Mixture prior:

Bayesian Feature Selection in GLMM • Fixed effects: choose mixture priors for the fixed effects coefficients. • Random effects: reparameterization • LDU decomposition of the random effect covariance • Choose mixture prior for the elements in the diagonal matrix.

…… Berlin 1 0.01 0.00 34.10 40.90 37.50 Berlin 1 0.01 0.00 35.70 35.60 32.10 Tokyo 0 0.01 0.00 56.50 28.90 27.10 Tokyo 1 0.01 0.00 51.50 29.90 25.90 …… Missing Identification in GLMM • Data table of DDE bioassay • What if the first column is missing? • Unusual case in statistics, so few people work on it. • But this is the problem we have to solve for concept drift.

Concept Drift • Primary data Auxiliary data • If we treat the drift variable as random variable, concept drift is a random intercept model - a special case of GLMM.

Clustering in Concept Drift K = 51 clusters (including 0) out of 300 auxiliary data points Bin resolution = 1

Clustering in Concept Drift • There are intrinsic clusters in auxiliary data with respect to drift value. • “The simplest explanation is best.” Occam Razor Why don’t we instead give each cluster a random effect variable?

Clustering in Concept Drift • In usual statistics applications, we know which individuals share the same random effect . • However, in concept drift, we do not know which individuals (data points or features) share the same random-intercept. • Can we train the classifier and cluster the auxiliary data simultaneously? This is a new problem we aim to solve.

Clustering in Concept Drift • How many clusters (K) should we include in our model? • Does choosing K actually make sense? • Is there a better way?

Part II Nonparametric Method

Nonparametric method • Parametric method: the forms of the underlying density functions were known. • Nonparametric method is a wide category, e.g. NN, minmax, bootstrapping... • Nonparametric Bayesian method: make use of the Bayesian calculus without prior parameterized knowledge.

Cornerstones of NBM • Dirichlet process (DP) allow flexible structures to be learned and allow sharing of statistical strength among sets of related structures. • Gaussian process (GP) allow sharing in the context of multiple nonparametric regressions (suggest to have a separate seminar on GP)

Chinese Restaurant Process • Chinese restaurant process (CRP) is a distribution on partitions of integers. • CRP is used to represent uncertainty over the number of components in a mixture model.

Chinese Restaurant Process • Unlimited number of tables • Each table has an unlimited capacity to seat customers.

Chinese Restaurant Process The (m+1)th subsequent customer sits at a table drawn from the following distribution: where mi is the number of previous customers at table i and is a parameter.

Chinese Restaurant Process Example: The probability that next customer sits at table

Chinese Restaurant Process • CRP yields an exchangeable distribution on partitions of integers, i.e., the specific ordering of the customers is irrelevant. • An infinite set of random variables is said to be infinitely exchangeable if for every finite subset , we have for any permutation .

Dirichlet Process G0: any probability measure on the reals, : partition. A process is a Dirichlet process if the following equation holds for all partitions: where is a concentration parameter. Note: Dir– Dirichlet distribution, DP - Dirichlet process.

Dirichlet Process • Denote a sample from the Dirichlet process as • G is a distribution. • Denote a sample from the distribution G as Graphical model for a DP generating the parameters .

Dirichlet Process Properties of DP:

Dirichlet Process The marginal probabilities for a new This is Chinese restaurant process.

DP Mixtures If F is a normal distribution, this is the a Gaussian mixture model.

Applications of DP • Infinite Gaussian Mixture Model (Rasmussen 2000) • Infinite Hidden Markov Model (Beal 2002) • Hierarchical Topic Models and the Nested Chinese Restaurant Process (Blei 2004)

Implementation of DP Gibbs sampling • If G0 is a conjugate prior for the likelihood given by F: (Escobar 1995) • Non-conjugate prior: (Neal 1998)

Variational Inference for DPM • The goal is to compute the predictive density under DP mixture • Also, we minimized the KL distance between p and a variational distribution q. • This algorithm is based on the stick-breaking representation of DP. (I would suggest to have a separate seminar on stick-breaking view of DP and variational DP.)

Open Questions • Can we apply ideas of infinite models beyond identifying the number of states or components in a mixture? • Under what conditions can we expect these models to give consistent estimates of densities? • ... • Specified to our problem: Non conjugate due to sigmoid function

Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures