350 likes | 361 Views
This study aims to identify underlying groups of patients with bipolar disorder based on their response trajectories and predict group membership based on patient covariates. The goal is to estimate response trajectories for each class and determine if there are genetic differences among patients in different classes.
E N D
Longitudinal Latent Class Models with Application to Outcomes for Bipolar Disorder January 24, 2008
Motivating Case Study: The STEP-BD Study “Systematic Treatment Enhancement Program” • Longitudinal, multi-site study of ~ 3700 patients diagnosed with Type-I and Type-II bipolar disorder • Administered an enhanced, systematic treatment program across sites • Recorded outcomes for up to 3 years on a variety of psychological and functional instruments (MADRS, YMRS, CLINSTAT) More at www.stepbd.org
Study Aims 1) Identify underlying groups of patients based on their response trajectories • Do some patients improve quickly? • Do others stay symptomatic? • Do some improve then relapse? • For example, recent work (Perlis, 2006) has shown that 60% of symptomatic patients achieve remission within 2-year period • Yet, only 50% of these stay recovered for a full 2 years • Do patients cluster into groups based on these trajectories? • For example, can we identify a “long-term” recovery group? • Can we predict these groups through a regression model?
Study Aims (Cont’d) 2) Given that distinct groups exist, we want to estimate the response trajectories for each class 3) Use patient covariates to predict group membership • Can we predict who is likely to belong to a particular trajectory class? 4) Long-term aim: determine if there are genetic differences among patients in different classes
Latent Class Models for Longitudinal Data • A common approach is to fit a parametric regression model relating Y to covariates X via class-specific parameters θk (k=1,…,K) • The marginal model (1) is a mixture of class specific densities, f, and class membership probabilities, πk • Typically assume f belongs to same distributional family for all components • In particular, πk = Pr(Ci = k), where Ci is a latent class-indicator variable for subject i • Model (1) is called a latent class or finite mixture model • Sometimes “latent class” reserved for discrete Y
Latent Class vs. Random Effects Model • Model (1) can be thought of as a discrete counterpart to the usual, continuous random effects model: • In (1) the random effect distribution, g, is discrete rather than continuous
Aims of Latent Class Modeling in Longitudinal Setting • Estimate class-specific regression parameters, θk • Conditional on data, predict class-membership probabilities for each subject: • These will be determined by subject’s response vector, • Potentially by additional covariates as well
References • McLachlan G. and Peel D. (2000). Finite Mixture Models. (Wiley). • Fruhwirth-Schnatter S. (2006). Finite Mixture and Markov Switching Models. (Springer) • Nagin D. (1999). Analyzing Developmental Trajectories. Psychological Methods, 4(2), 139-157. • Muthén et al. (2002). General Growth Mixture Modeling for Randomized Preventive Interventions. Biostatistics, 3(4), 459-475. • Elliott et al. (2005). Using Bayesian Latent Growth Curve Model to Identify Trajectories of Positive Affect and Negative Events Following Myocardial Infarction. Biostatistics, 6(1), 119-143. “Growth mixture model” is a finite mixture of random effects models; also called the “heterogeneity model” (Verbeke and Molenberghs (2000))
Software • Mplus (Muthén and Muthén). http://www.statmodel.com/ • SAS Proc Traj http://www.andrew.cmu.edu/user/bjones/index.htm • R: Flexmix and Bayemix packages • WinBUGS Panel data only (?)
A Simple Simulation • Consider the following two-class model • So, subject i has observed data (yi, ti) and latent variable, Linear Trend for Class 1 Quadratic Trend for Class 2 For simplicity, assume conditional independence and homoscedasticity π1 and π2 may vary across subjects as a function of covariates (e.g., older subject more likely to be in class 1, a priori). See below.
Simulation Model Simulated Responses (n=100) β1=c(2, .5), π1=.43 Class 1 Class 2 β2=c(1, 2, -.5), π2=.57 σ2=1 t
Estimation • Aims are to estimate β1, β2 and σ2 • Conditional on yi, predict probability that subject i belongs to class k, • If we knew class allocations, Ci’s, then we could estimate parameters via typical regression techniques • In a sense, Ci’s are like missing data
Complete-Data Estimation • Consider complete-data likelihood, • If we knew this, we could obtain ML estimates of model parameters • For example, is the typical least squares estimate based on subjects from class 1 • Likewise, Sample estimate of the true population proportion in class 1
Frequentist Estimation (Cont’d) • Since we don’t know Ci’s, we can use the EM algorithm: • E-Step: For each subject, compute expected complete data log-likelihood using current model estimates MLEs from previous iteration Introduce indicator variables, Dik=1 if Ci=k, 0 o.w.
M-Step • For details on E-Step, see: • Fruhwirth-Schnatter (2006) p. 50-51; • Also Redner and Walker (1984) and Meng (1997) • M-Step: Obtain parameter estimates, , by maximizing expected complete-data log-likelihood • Note: EM Algorithm used by R Flexmix, Mplus and Proc Traj
Bayesian Inference • In the Bayesian setting, we place a prior distribution on π, as well as other model parameters • Use MCMC (e.g., Gibbs sampling) to obtain posterior samples for parameters 1) At each iteration, we update class indicators, Ci, from their full conditional 2) Conditional on Ci, update other model parameters using standard MCMC techniques Data augmentation step
Prior Distribution • For our two-class linear model, typical conjugate priors might be: Di = 0 if class 1, 1 if class 2 π2 = probability of class 2 membership. Here assumed same for all subjects. π2 = probability of class 2 membership. Could be a function of covariates and additional parameters, α. In this case, we would have π2i ; that is, a different prior probability for each covariate profile – SEE BELOW
Posterior Computation • Gibbs Sampler: Bayes’ Theorem Update
Simulation Example (Cont’d) β1=c(2,.5), π1=.43 Class 1 Class 2 β2=c(1,2,-.5), π1=.57 σ2=1 • In addition, assume that the prior class probabilities are related to subjects’ age ~ N(40, 16) via a probit model: Older subjects less likely to be in class 2; more likely in class 1
Simulation Results • Ran Gibbs Sampler in R and WinBUGS • 5,000 iterations, burn-in of 1,500 • For R program, used Albert and Chib (1993) data augmentation procedure to estimate α • Compared various initial values • In both ML and Bayes estimation, inits do matter!
Results Class 1 Class 2
Subject-Specific Class Probabilities • Subject 15: y = (2.01,3.91,2.70,3.46,5.25) age=39.14 true class = 1 posterior π1 ≈ 1.0 • Subject 9: y = (2.42, 1.99, 3.83, 3.64, 3.39) age=37.9 true class=2 posterior π2 = .18 • Two subjects out of 100 were “misclassified” • prob of being in true class < .5 • Average posterior proportion of in Class 2 = .56 Recall: younger class 1 Correctly classified “Misclassified”
Other Software • WinBUGS gives similar results • SAS Proc Traj influenced by choice initial values • Results reasonable if inits chosen carefully ─ Observed -- Predicted
Proc Traj (Default Inits) ─ Observed -- Predicted
Non-Identifiability Due to Overfitting • If we fit too many classes to the data, our model may be non-identifiable • Non-Identifiability can arise in two ways: 1) Some πk = 0 (empty classes) • No information in the data to estimate πk • MLE regularity conditions fail; improper posterior under vague prior for π 2) θk = θj for some j ≠ k • Split a single class into two identical classes, with distinct membership probabilities • Can estimate π but not θk • Need to address the two types separately See Fruhwirth-Schnatter (2006) section 1.3
Non-Identifiability Due to Label Switching • K! different arrangements of components yield same marginal likelihood • For example, • Loosely, it doesn’t matter which classes we label 1 and 2 • Not a problem for ML estimation; same MLEs under both labelings • However, in Bayesian setting, can cause problems when trying to sample component specific parameters • For example, we may sample μ1 ≈ 1 at some iterations and μ1 ≈ 3 at others
Example: f(y)=.5N(μ1 ,1) + .5N(μ2 ,1) Note: vague N(0, 100) prior on μ1 and μ2,, beta(1,1) on π Posterior Densities and Iteration Plots for μ1 (n=100) Samples for μ1 never drift far from 1 Occasionally, draw μ1’s ≈ 2.5 and μ2’s≈ 1 and stay there, since it’s a posterior mode beta(1,1) prior can yield posterior π1≈ 0 Here π1|y ≈ 0 sample from prior for μ1
Solution: Formal Identifiability Constraints • Assume a priori constraints among parameters • Ensure nonempty classes, choose prior that bounds π away from 0 • π ~Dir(4,…,4) • For any two classes, make sure there’s at least one parameter θj that differentiates the classes • If classes are well-distinguished by data, unconstrained model probably okay • If classes are not distinct and n is small, the model is not well-identified, and computational problems can arise – need a constraint • In Bayesian context, can implement identifiability constraint via constrained prior distribution on parameters • However, arbitrary constraints do not necessarily avoid label switching (Celeux, 1998; Stephens, 2000) Ensures no empty classes Ensures non- identical classes
Constrained Permutation Sampler • Fruhwirth-Shnatter (2006) recommends a constrained permutation sampler • Randomly permute parameters at each iteration, according to assumed constraint • If constraint is valid, then no label switching • If constraint not valid, label switching will arise
Determining the Number of Classes (K) • “Mode hunting” plots (Fruhwirth-Schnatter (2006)) • Plot posterior draws from unconstrained permutation sampler for each pair of classes (k,k’) If K is correct, should get K(K-1) distinct clusters, bounded away from line, μ1 = μ2 Evidence of potential overfitting; recall, n=100
One-Component Model, μ1 = μ2 = 1 • What happens if we fit a two-component model to a model simulated under one component, f(y|μ)=N(μ,1)? Posterior Mode/MLE μ-hat=1under one-component mixture
Other Strategies for determining K • Compare BICK for various choices of K • Choose K that minimizes BIC • Reoder and Wasserman (1997) • Fraley and Raftery (1998) • Bayes Factors • Posterior Predictive Checks • Choose a test statistic, T • At each MCMC iteration, compute T(yf) based on a posterior predictive sample, yf • Plot posterior distribution of T(yf) given y for model MK • If model is underfitting, the observed sample test statistic, T(yobs) , may lie outside, or in tails of posterior distribution, p[T(yf)|y,MK] • Dimension-switching MCMC such as reversible jump (Green, 1995)
Appendix: Related Methods • Related methods include: • Cluster analysis: choose partition that minimizes within-cluster variance of response, Y • Model-Based Clustering: Assume that Y arises from a finite-mixture distribution with K components – use model to predict class memberships • Banfield and Raftery (1993) Biometrics • Fraley and Raftery (1998) The Computer Journal • Mclust package in R • Tree-based methods (e.g., CART): sequentially split covariates, X, in a way that “optimizes” the within-cluster homogeneity of Y • Nonparametric procedure • Segal (1992). Tree-Structured Methods for Longitudinal Data. Journal of the American Statistical Association, Vol. 87 (418), 407-418 • Extends splitting rule to accommodate repeated measures • Rpart and Tree packages in R (may not accommodate repeated measures) • In a sense, latent class regression models combine regression aspect of CART with finite mixture aspect of MBC
Tree Diagram for Simulated Example • CART yields 9 classes, some with small n Classifies by visits rather than subjects – i.e., ignores repeated measures – may not be meaningful Visit 0 Visit > 0 Mean Y Youngest Patients