460 likes | 724 Views
Cross-Sectional Mixture Modeling. Shaunna L. Clark Advanced Genetic Epidemiology Statistical Workshop October 23, 2012. Outline . What is a mixture? Introduction to LCA (LPA) Basic Analysis IdeasPlan and Issues How to choose the number of classes How do we implement mixtures in OpenMx ?
E N D
Cross-Sectional Mixture Modeling Shaunna L. Clark Advanced Genetic Epidemiology Statistical Workshop October 23, 2012
Outline • What is a mixture? • Introduction to LCA (LPA) • Basic Analysis Ideas\Plan and Issues • How to choose the number of classes • How do we implement mixtures in OpenMx? • Factor Mixture Model • What do classes mean for twin modeling?
Homogeneity Vs. Heterogeneity • Most models assume homogeneity • i.e. Individuals in a sample all follow the same model • What have seen so far (for the most part) • But not always the case • Ex: Sex, Age, Patterns of Substance Abuse
What is Mixture Modeling Used to model unobserved heterogeneity by identifying different subgroups of individuals Ex: IQ, Religiosity
Latent Class Analysis (LCA) Also known as Latent Profile Analysis (LPA) if you have continuously distributed variables
Latent Class Analysis Introduced by Lazarsfeld & Henry, Goodman, Clogg, Dayton & Mcready • Setting • Cross-sectional data • Multiple items measuring a construct • 12 items measuring the construct of Cannabis Abuse/Dependence • Hypothesized construct represented as latent class variable (categorical latent variable) • Different categories of Cannabis Abuse\Dependence patterns • Aim • Identify items that indicate classes well • Estimateproportion of sample in each class (class probability) • Classify individuals into classes (posterior probabilities)
Latent Class Analysis Cont’d C x1 x2 x3 x4 x5
Latent Class Analysis Model Dichotomous (0/1) indicators u: u1, u2, ... , ur Categorical latent variable c: c = k ; k = 1, 2, ... , K Marginal probability for item uj = 1, (probability item uj =1 is the sum over all class of the product of the probability of being in class k and the probability of endorsing item uj given that you are in class k)
Joint Probabilities • Joint probability of all u’s, assuming conditional independence: • Probability of observing a given response pattern is equal to the sum over all classes of the product of being in a given class and the probability of observing a response on item 1 given that you are in latent class k, . . . (repeat for each item)
Posterior Probabilities • Probability of being inclass k given your response pattern • Used to assign most likely class membership • Based on highest posterior probability
Model Testing • Log-likelihood ratio χ2 test (LLRT) • Overall test against the data with H1 being the unrestricted multinomial • Problem: Not distributed as χ2 due to boundary conditions • Don’t use it!!! (McLachlan& Peele, 2000) • Information Criteria • Akaike Information Criteria, AIC (Akaike,1974) AIC = 2h-2ln(L) • Bayesian Information Criteria, BIC (Schwartz, 1978) BIC = -2ln(L)+h*ln(n) • Where L = log-likelihood, h = number of parameters, n = sample size • Chose model with lowest value of IC
Other Tests • Since can’t do LLRT, use test which approximate the difference in LL values between k and k-1 class models. • Vuong-Lo-Mendell-Rubin, LMR-LRT (Lo, Mendell, & Rubin, 2001) • Parametric bootstrapped LRT, BLRT (McLachlan, 1987) • P-value is probability that H0 is true • H0: k-1 classes; H1: k classes • A low p-value indicates a preference for the estimated model (i.e. k classes) • Look for the first time the p-value is non-significant or greater than 0.05
Analysis Plan • Fit model with 1-class • Everyone in same class • Sometimes simple is better • Fit LCA models 2-K classes • Chose best number of classes Seems simple right???
Not really . . .Lots of known issues in Mixture Analysis • Global vs. Local Maximum Log Likelihood Log Likelihood Global Global Local Local Parameter Parameter • Use multiple sets of random starting values to make sure have global solution. Make sure that best LL value has replicated
Determining the number of classes: Class Enumeration • No agreed upon way to determine the correct number of latent classes • Statistical comparisons (i.e. ICs, LRTs) • Interpretability and usefulness of classes • Substantive theory • Relationship to auxiliary variables • Predictive validity of classes • Class size • Quality of Classifications (not my favorite) • Classification table based on posterior probabilities • Entropy - A value close to 1 indicates good classification in that many individuals have posterior probabilities close to 0 or 1
Suggested Strategy • Nylund et al. (2007), Tofighi & Enders (2008), among others • Simulation studies comparing tests and information criteria described previously • Suggest: • Use BIC and LMR to narrow down the number of plausible models • Then run BLRT on those models because BLRT can be computationally intensive
OpenMX: LCA Example Script LCA_example.R
Mixtures in OpenMx • Specify class-specific models • Create MxModel objects for each class • Specify class probabilities • Create an MxMatrix of class probabilities\proportions • Specify model-wide objective function • Pull everything together in a parent model with data • Weighted sum of the class models • Estimate entire model Note: One of potentially many ways to do this
Class Specific Models nameList <- names(<dataset>) class1 <- mxModel("Class1", mxMatrix("Iden", name = "R", nrow = nvar, ncol = nvar, free=FALSE), mxMatrix("Full", name = "M", nrow = 1, ncol = nvar, free=FALSE), mxMatrix("Full", name = "ThresholdsClass1", nrow = 1, ncol = nvar, dimnames = list("Threshold",nameList), free=TRUE), mxFIMLObjective(covariance="R", means="M", dimnames=nameList, thresholds="ThresholdsClass1",vector=TRUE)) Repeat for every class in your model Don’t be like me, make sure to change class numbers
Define the Model lcamodel <- mxModel("lcamodel", class1, class2, mxData(vars, type="raw"), Next, specify class membership probabilities
Class Membership Probabilities • When specifying need to remember: • Class probabilities must be positive • Must sum to a constant - 1 mxMatrix("Full", name = "ClassMembershipProbabilities", nrow = nclass, ncol = 1, free=TRUE, labels = c(paste("pclass", 1:nclass, sep=""))), mxBounds(c(paste("pclass", 1:nclass, sep="")),0,1), mxMatrix("Iden", nrow = 1, name = "constraintLHS"), mxAlgebra(sum(ClassMembershipProbabilities), name = "constraintRHS"), mxConstraint(constraintLHS == constraintRHS),
Model-wide objective function • Weighted sum of individual class likelihoods • Weights are class probabilities • So for two classes:
Model Wide Objective Function Cont’d mxAlgebra( -2*sum(log(pclass1%x%Class1.objective + pclass2%x%Class2.objective)), name="lca"), mxAlgebraObjective("lca")) ) Now we run the model: model <- mxRun(lcamodel) And we wait and wait and wait till it’s done.
Profile Plot • One way to interpret the classes is to plot them. • In our example we had binary items, so the thresholds are what distinguishes between classes • Can plot the thresholds • Or you can plot the probabilities • More intuitive • Easier for non-statisticians to understand
Profile Plots in R\OpenMx #Pulling out thresholds class1T <- model@output$matrices$Class1.ThresholdsClass1 class2T <- model@output$matrices$Class2.ThresholdsClass2 #Converting threshold to probabilities class1P<-t(1/(1+exp(-class1T))) class2P<-t(1/(1+exp(-class2T)))
Profile Plots Cont’D plot(class1P, type="o", col="blue",ylim=c(0,1),axes=FALSE, ann=FALSE) axis(1,at=1:12,lab=nameList) axis(2,las=1,at=c(0,0.2,0.4,0.6,0.8,1)) box() lines(class2P,type="o", pch=22, lty=2, col="red") title(main="LCA 2 Class Profile Plot", col.main="black",font.main=4) title(xlab="DSM Items", col.lab="black") title(ylab="Probability", col.lab="black") legend("bottomright",c("Class 1","Class 2"), cex=0.8, col=c("blue","red"),pch=21:22,lty=1:2)
OpenMx Exercise • Unfortunately, it takes long time for these to run so not feasible to do in this session • However, I’ve run the 2-, 3-, and 4- class LCA models for this data and (hopefully) the .Rdata files are posted on the website • Exercise: Using the .Rdata files • Determine which model is better according to AIC\BIC • Want the lowest value • Make a profile plot of the best solution and interpret the classes • What kind of substances users are there?
Code to pull out LL and compute AIC\BIC #Pull out LL LL_2c <- model@output$Minus2LogLikelihood LL_2cnsam = 1878 #parameters npar<- (nclass-1) + (nthresh*nvar*nclass npar #Compute AIC & BIC AIC_2c = 2*npar + LL_2c AIC_2c BIC_2c = LL_2c + (npar*log(nsam)) BIC_2c
Problem with LCA • Once in a class, everyone “looks” the same. • In the context of substance abuse, unlikely that every user will have the same patterns of use • Withdrawal, tolerance, hazardous use • There is variation within a latent class • Severity • One proposed solution is the factor mixture model • Uses a latent class variables to classify individuals and latent factor to model severity
Factor Mixture Model σ2F F C λ3 λ1 λ2 λ4 λ5 x1 x2 x3 x4 x5 Classes can be indicated by item thresholds (categorical)\ item means (continuous) or factor mean and variance
General Factor Mixture Model yik = Λkηik + εik, ηik = αk+ ζik, where, ζik~ N(0, Ψk) • Similar to the FA model, except many parameters can be class varying as indicated by the subscript k • Several variations of this model which differ in terms of the measurement invariance • Lubke & Neale (2005), Clark et al. (2012)
How do we do this in OpenMx? • You’ll have to wait till tomorrow! • Factor Mixture Model is a generalization of the Growth Mixture Model we’ll talk about tomorrow afternoon.
Mixtures & Twin Models How do we combine the ACDE model and mixtures?
Option 1: ACE Directly on the Classes 1.0 (MZ) / 0.5 (DZ) 1.0 cB CA CB cA eA eB aA aB x1B x1A x2B x2A x3B x3A x4A x4B
What would this look like for the FMM? 1.0 (MZ) / 0.5 (DZ) 1.0 1.0 (MZ) / 0.5 (DZ) 1.0 aA aA CA CB cB eA cA aA aB eB cA cA eA eA FB FA x1B x1A x2B x2A x3B x3A x4A x4B
FMM & ACE CONT’D • One of many possible ways to do FMM & ACE in the same model • Can also have class specific ACE on the factors • Each class has own heritability From Muthén et al. (2006)
Issue with Option 1 • Model is utilizes the liability threshold model to “covert” the latent categorical variable, C, to a latent normal variable • This requires that classes are ordered • Ex: high, medium, low users • Don’t always have nicely ordered classes • Models are VERY time intensive • Take a vacation for a week or two
Option 2: Three-Step Method • Estimate mixture model • Assign individuals into their most likely latent class based on the posterior probabilities of class membership • Use the observed, categorical variable of assigned class membership as the phenotype in a liability threshold model version of ACE analysis Note: Requires ordered classes
Option 2a • Contingency table analysis using most likely class membership • Concordance between twins in terms of most likely class membership • If your classes are not ordered • Odds Ratio • Excess twin concordance due to stronger genetic relationship can be represented by the OR for MZ twins compared to the OR for DZ twins. • Place restrictions on the contingency table to test specific hypotheses • Mendelian segregation, only shared environmental effects • Eaves (1993)
Issues with Option 2 • Potential for biased parameter estimates and underestimated standard errors • Assigned membership ignores fractional class membership suggested by posterior probabilities • Treat the classification as not having any sampling error • Good option when entropy is high\ well separated classes
Selection of Cross-Sectional Mixture Genetic Analysis Writings • Latent Class Analysis • Eaves, 1993; Muthén et al., 2006; Clark, 2010 • Factor Mixture Analysis • Neale & Gillespie, 2005 (?); Clark, 2010; Clark et al. (in preparation) • Additional References • McLachlan, Do, & Ambroise, 2004 • Mixtures in Substance Abuse • Gillespie (2011, 2012) • Great cannabis examples