Latent Class Analysis in SAS  : Promise, Problems, and Programming

Latent Class Analysis in SAS:Promise, Problems, and Programming David M. Thompson Department of Biostatistics and Epidemiology College of Public Health, OUHSC

Latent class analysis (LCA) • LCA validates classification in the absence of a gold standard for decision-making. • Incorporation into SAS is recent. Invited paper 192-2007

LCA and Patient Classification Patient classification is part of many clinical decisions. • Diagnosis • Prognosis Invited paper 192-2007

Patient classification in the absence of a gold standard Diagnosis • Diagnostic categories may be emerging or unclear. Prognosis • predicting rehabilitation outcomes • counseling patients and families regarding expectations Invited paper 192-2007

Outline • LCA defined • SAS approaches to LCA • Producing standard errors • Curing the problem of fracturing of estimates • Limitations of LCA Invited paper 192-2007

Latent class analysis (LCA) • LCA is a parallel to factor analysis, but for categorical responses. • Like factor analysis, LCA addresses the complex pattern of association that appears among observations…. Invited paper 192-2007

…and attributes the pattern to a set of latent (underlying, unobserved) factors or classes. Invited paper 192-2007

A complex pattern of responses emerged when undergraduates made ethical decisions in response to four stimulus scenarios Stouffer, S.A., & Toby, J. (1951). Role conflict and personality. American Journal of Sociology, 56, 395-406. Invited paper 192-2007

LCA predicts latent class membership such that observed responses are independent. Invited paper 192-2007

LCA estimatesLatent class prevalencesConditional probabilities: probabilities of a specific response, given class membership P(A-P acc | LC 1) P(St.Mkt.Info | LC 2) Invited paper 192-2007

Conditional probabilities are analogous to sensitivities and specificities, but are calculated in the absence of a gold standard. P(A-P acc | LC 1) P(St.Mkt.Info | LC 2) Invited paper 192-2007

LC parameter estimates for Stouffer and Toby data Invited paper 192-2007

Indicators’ informativeness defined by differences in conditional probabilities Invited paper 192-2007

LCA works on unconditional contingency table (a table with no information on LC membership) Invited paper 192-2007

LCA’s goal is to produce a complete (conditional) table that assigns counts for each latent class: Invited paper 192-2007

Assumptions of LCA • Exhaustiveness ABCD = X=t ABCDX • Conditional (Local) Independence ABCDX = ABCD|X =A|X B|X C|X D|X X (Goodman’s probabilistic parameterization of an LC model with 4 observed variables) Invited paper 192-2007

ML approach to LC estimation • probability of obtaining observed count nijkl for response profile {i,j,k,l} is (ABCDX )nijklt • likelihood of obtaining a set of observed counts for several response profiles is L = i j k l t (ABCDX )nijklt log L = i j k l t nijklt ln(ABCDX ) Invited paper 192-2007

ML approach to LC estimation • Because LC membership (X=t) is unobserved, the likelihood function and likelihood surface are complex. Invited paper 192-2007

EM algorithm calculates L when some data (X) are unobserved “M” step produces ML estimates from complete table “E” step uses parameter estimates to update expected values for cell counts nijklt in complete contingency table Invited paper 192-2007

EM algorithm requires initial estimates “M” step Functions achieved in SAS-IML or conventional DATA steps 1st “E” step: provides initial estimates to “fill in” missing information on LC membership “E” step Invited paper 192-2007

EM algorithm instituted using SAS-IML or conventional DATA steps “M” step 1st “E” step: randomly assigns each response profile to one latent class “E” step Invited paper 192-2007

Alternative approach using SAS PROC CATMOD moon.ouhsc.edu/dthompso/ latent%20variable%20research/lvr.htm “M” step PROC CATMOD 1st “E” step: SAS DATA step randomly assigns each response profile to one latent class “E” step SAS DATA step Invited paper 192-2007

Other approaches • PROC LCA, Methodology Center of Penn State University methcenter.psu.edu/lca/ • LC regression macros K. Bandeen-Roche, Johns Hopkins Invited paper 192-2007

EM algorithm does not produce standard errors Strategies include: • Converting CATMOD’s loglinear parameter SE into probabilities • Bootstrapping SE • Obtain SE from multiple solutions Invited paper 192-2007

Strategy 1: Convert SE obtained from CATMOD’s loglinear model Invited paper 192-2007

Loglinear SE are convertible to probabilities (after Heinen, 1996) But probabilities are complex nonlinear functions of their loglinear counterparts: • latent class prevalences: P(X=t) = exp tX / x exp tX • conditional probabilities: P(A=i | X=t) = P(AX) / P(X) =exp(iA+itAX) / a exp(iA+itAX) Invited paper 192-2007

Strategy 2: Bootstrap parameter estimates and SE • Generate initial LCA solution and use its parameter estimates to generate a complete (conditional) contingency table. • From complete table, generate B bootstrapped unconditional tables. • Perform LCA on each table, producing B sets of parameter estimates. • The mean and SD of these constitute, respectively, parameter estimates and SE. Invited paper 192-2007

Bootstrapping • Creating multiple samples by resampling repeatedly from original sample • Bootstrapped samples typically chosen randomly, with replacement, so n equals that of original sample • Statistical operation repeated on each bootstrapped sample. Invited paper 192-2007

Efficient bootstrapping code (Barker, 2005) data boot; do bootsamp=1 to 100; do i=1 to nobs; pick=round(ranuni(0)*nobs); set original nobs=nobs point=pick; output; end; end; stop; run; Invited paper 192-2007

Bootstrapped estimates Invited paper 192-2007

Strategy 3: Generate multiple solutions from different starting values Invited paper 192-2007

Estimates and SE from multiple solutions, each from a different initial assignment of response profiles Invited paper 192-2007

Estimates of conditional probabilityP(A=1|X=1) from multiple estimates P(A=1|X=1) from bootstrapped estimates Repeated solutions approach may be more useful than bootstrapping because it explicitly accounts for LCA’s sensitivity to initial estimates. Invited paper 192-2007

Multiple solutions and bootstrapping approaches are useful, but present a new challenge. Above: Distribution of multiple estimates of conditional probability P(A=1|X=1) Below: P(A=1|X=2) “Fracturing” of distributions of LC estimates. Invited paper 192-2007

What fractures the distributions? Invited paper 192-2007

What fractures the distributions? Latent classes have no intrinsic meaning. Identification of LC membership is flexible. LCA can attribute a vector of parameter estimates to LC X=1 for one solution, and to LC X=2 for the next. Invited paper 192-2007

How to resolve fracturing Simulation studies confirm that vectors of parameter estimates are individually coherent. Consistent assignment of vectors to the appropriate latent classes should cure fracturing. What rule leads to consistent assignment? Invited paper 192-2007

Rule: Reflect all estimates in a vector into the half-plane most heavily populated by conditional probabilities of the most informative indicator. In this example, D is the most informative indicator, so estimates for every parameter are reflected into indicator D’s more heavily populated (upper left) half plane. Invited paper 192-2007

Distribution of estimates after reflection P(A=1|X=1) P(A=1|X=2) Invited paper 192-2007

With the fracturing problem solved, the multiple solutions approach is an attractive strategy to overcome EM algorithm’s inability to produce standard errors. Invited paper 192-2007

Limitations of LCA • Sample size must support detection of weak latent structures, those with: Rare latent class(es) Uninformative indicators Invited paper 192-2007

Limitations of LCA • Fit statistics primarily assess conditional independence and so don’t alert the analyst when LCA is struggling to characterize weak latent structure. Invited paper 192-2007

Limitations of LCA • Violations of assumption of conditional independence • conditional (or residual) dependence Invited paper 192-2007

Conditional dependence • leads to poor estimation Overestimation of informativeness of both correlated indicators Overestimation of prevalence of other LC • leads to poor model fit Analyst may respond by positing additional latent classes, which complicates interpretation. Model’s applicability limited when modifications increasingly capitalize on data’s idiosyncracies. Invited paper 192-2007

Assessing conditional dependence Z scores compare observed log odds ratios for pairs of indicators with those expected under conditional independence (Garrett & Zeger, 2000) Pairs of Indicators ___________Log odds____________ Expected Observed ASE z a b 0.2993 0.7270 0.3557 1.2024 a c 0.3630 0.7953 0.3557 1.2154 a d 0.7847 0.5312 0.4796 -0.5285 b c 0.6534 0.5586 0.2760 -0.3435 b d 1.2871 1.3876 0.3430 0.2929 c d 1.6395 1.6994 0.3685 0.1626 Large z scores arouse suspicion that pairs of indicators are conditionally dependent. Invited paper 192-2007

Accounting for conditional dependence • Pairwise conditional dependence can be incorporated into a revised model. • Patterns of dependence and independence are flexibly expressed in both LCA parameterizations Probabilistic (Goodman) ABCDX=A|X B|X C|X D|X X Loglinear (Haberman) ln ABCDX =+ iA + jB + kC + lD + tX + itAX + jtBX + ktCX + ltDX Invited paper 192-2007

Accounting for conditional dependence • Take advantage of CATMOD’s loglinear modeling capabilities in the M step. • The standard M step that assumes conditional independence: ods output estimates=mu; proc catmod order=data; weight count; model a*b*c*d*x=_response_ /wls addcell=.1; loglin a b c d x a*x b*x c*x d*x; run; quit; ods output close; Invited paper 192-2007

Accounting for conditional dependence • Modifying the CATMOD M step to model conditional dependence between indicators B and C: ods output estimates=mu; proc catmod order=data; weight count; model a*b*c*d*x=_response_ / wls addcell=.1; loglin a b c d x a*x b*x c*x d*x b*c b*c*x; run; quit; ods output close; Invited paper 192-2007

Concluding remarks • LCA is a potentially valuable tool in clinical epidemiology for clarifying ill-defined diagnostic and prognostic classifications. • Recent work brings LCA into SAS’ analytic framework. Invited paper 192-2007

In any approach to LCA, sensitivity to initial estimates requires caution • Employ repeated solutions from different initial estimates • E-M loop should iterate between 3 and 40 times • Probe assumption of conditional independence • At least four indicators needed • Expanded model can account for dependence Invited paper 192-2007

Latent Class Analysis in SAS  : Promise, Problems, and Programming

Latent Class Analysis in SAS  : Promise, Problems, and Programming

Presentation Transcript