Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Latent Tree ModelsPart III: Learning Algorithms Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. http://www.cse.ust.hk/~lzhang

Learning Latent Tree Models To Determine • Number of latent variables • Cardinality of each latent variable • Model structure • Probability distributions Model selection: 1, 2, 3 Parameter estimation: 4

Light Bulb Illustration • Run interactive program “LightBulbIllustration.jar” • Illustrate the possibility of inferring latent variables and latent structures from observed co-occurrence patterns.

Part III: Learning Algorithms • Introduction • Search-based algorithms • Algorithms based on variable clustering • Distance-based algorithms • Empirical comparisons • Spectral methods for parameter estimation

Search Algorithms • A search algorithm explores the space of regular model guided by a scoring function: • Start with an initial model • Iterate until model score ceases to increase • Modify the current model in various ways to generate a list of candidate models. • Evaluate the candidate models using the scoring function. • Pick the best candidate model • What scoring function to use? How do we evaluate candidate models? • This is the model selection problem.

Model Selection Criteria • Bayesian score: posterior probability P(m|D) P(m|D) = P(m)P(D|m) / P(D) = P(m)∫ P(D|m, θ) P(θ |m) dθ / P(D) • BIC Score: Large sample approximation of Bayesian score BIC(m|D) = log P(D|m, θ*) – d/2 logN • d: number of free parameters; N is the sample size. • θ*: MLE of θ, estimated using the EM algorithm. • Likelihood term of BIC: Measure how well the model fits data. • Second term: Penalty for model complexity. • The use of the BIC score indicates that we are looking for a model that fits the data well, and at the same time, not overly complex.

Model Selection Criteria • AIC (Akaike, 1974): AIC(m|D) = log P(D|m, θ*) – d/2 • Holdout likelihood • Data => Training set, validation set. • Model parameters estimated based on the training set. • Quality of model is measured using likelihood on the validation set. • Cross validation: too expensive

Search Algorithms • Double hill climbing (DHC), (Zhang 2002, 2004) • 7 manifest variables. • Single hill climbing (SHC), (Zhang and Kocka 2004) • 12 manifest variables • Heuristic SHC (HSHC), (Zhang and Kocka 2004) • 50 manifest variables • EAST, (Chen et al 2011) • 100+ manifest variables

Double Hill Climbing (DHC) • Two search procedures • One for model structure • One for cardinalities of latent variables. • Very inefficient. Tested only on data sets with 7 or fewer variables. (Zhang 2004) • DHC tested on synthetic and real-world data sets, together with BIC, AIC, and Holdout likelihood respectively. • Best models found when BIC was used. • So subsequent work based on BIC.

Single Hill Climbing (HSC) • Determines both model structure and cardinalities of latent variables using a single search procedure. • Uses five search operators • Node Introduction (NI) • Node Deletion (ND) • Node Relation (NR) • State Introduction (SI) • State Deletion (SI)

Node Introduction (NI) • NI involves a latent variable Y and some of its neighbors • It introduces a new node Y’ to mediateY and the neighbors. • The cardinality of Y’ is set at |Y| • Example: • Y2 introduced to mediate Y1 and its neighbors X1 and X2 • The cardinality of Y2 is set at |Y1|

Node Relocation (NR) • NR involves a latent variable Y, a neighbor Z of Y, and another neighbor Y’ of Y that is also a latent variable. • It relocates Z from Y to Y’. • Example: • X3 is relocated from Y1 to Y2

Node Deletion • ND involves a latent variable Y, a neighbor Y’ of Y that is a latent variables. • It remove Y, and reconnects the other neighbors of Y to Y’. • Example: • Y2 is removed w.r.t to Y1.

State Introduction/Deletion • State introduction (SI) • Increase the number of states of a latent variable by 1 • State deletion (SD) • Reduce the number of states of a latent variable by 1.

Single Hill Climbing (SHC) • Start with an initial model (LCM) • At each step: • Construct all possible candidate models using NI, ND, NR, SI and SD • Evaluate them one by one • Pick the best one • Still inefficient • Tested on data with no more than 12 variables. • Reason • Too many candidate models • Too expensive to run EM on all of them

The EAST Algorithm • Scale up SHC • Idea 1: Restrict NI to involve only two neighbors of the latent variable it operators on

Reachability • How to go from the left to the right then with the restriction? • First apply NI, and then NR NI NR

Idea 2: Reducing Number of Candidate Models • Not to use ALL the operators at once. • How? • BIC: BIC(m|D) = log P(D|m, θ*) – d/2 logN • Improve the two terms alternately • NI and SI improve the likelihood term? • Let be m’ obtained from m using NI or SI • Then, m’ includes m, hence has higher maximized likelihood log P(D|m’, θ’*) >= log P(D|m, θ*) • SD and ND reduce the penalty term.

The EAST Algorithm (Chen et al. AIJ 2011) • Start with a simple initial model • Repeat until model score ceases to improve • Expansion: • Search with node introduction (NI), and state introduction (SI) • Each NIoperation is followed by NRoperations to compensate for the restriction on NI. (See Slide 17) • Adjustment: • Search with NR • Simplification: • Search with node deletion (ND), and state deletion (SD) EAST: Expansion, Adjustment, Simplification until Termination

Idea 3: Parameter Value Inheritance • m: current model; • m’ :candidate model generated by applying a search operator on m. • The two models share many parameters • m: ( θ1,θ2); m’: ( θ1, λ2); • When evaluating m’, inherit values of the shared parameters θ1 from m, and estimate only the new parameters λ2: λ*2 = arg max λ2 log P(D|m’, θ1,λ2 )

Avoid Local Optimum at the Expansion Phase • NI: Increases structure complexity. • SI: Increases variable complexity. • Key Issue at the expansion phase: • Tradeoff between structure complexity and variable complexity

Operation Granularity • NI and SI are of different granularities • p = 100 • SI: 101 more parameters • NI: 2 more parameters • Huge disparity in granularity • Penalty term in BIC insufficient to handle • SI always preferred initially, • Quick increase in variable complexity • Leading to local optimum in model score

Dealing with Operation Granularity • EAST does not use BIC when choosing between candidate models produced by NI and SI. • Instead, it uses thecost-effectiveness principle • That is, select candidate model with highest improvement ratio Increase in model score per unit increase in model complexity. • Denominator is larger for operations that increase the number of model parameters more. • Can be justified using Likelihood Ratio Test (LRT) • It picks the candidate model that gives the strongest evidence to reject the null model (the current model) according to LRT.

Likelihood Ratio Test (LRT) Wikipedia • The alternative model includes the null model, and hence fits data better than the null model. • Whether it fits significantly better is determined by p-valueof the difference D , which approximately follows Chi-squared distribution with degree of freedom: d2 – d1

Likelihood Ratio Test (LRT) • Required D value for given p-value increases roughly linearly with d2-d1 • The ratio D/(d2-d1) closely related to p-value • It is a measure of the strength of evidence in favor of the alternative model .

Likelihood Ratio Test & Improvement Ratio • Second term is constant • First term is exactly ½ * D / (d2-d1) • Loosely speaking, the cost-effectiveness principle picks the candidate model that gives the strongest evidence to reject the null model (the current model) according to LRT.

Search Process on Danish Beer Data

EAST used on medical survey data • A few dozens variables • Hundreds to thousands observations • Model quality important • (Xu et al 2013)

Part III: Learning Algorithms • Introduction • Search-based algorithms • Algorithms based on variable clustering • Distance-based algorithms • Empirical comparisons • Spectral methods for parameter estimation

Algorithm based on Variable Clustering • Key Idea • Group variables into clusters • Introduce a latent variable for each cluster • For discrete variables, mutual information is used as similarity measure • Algorithms • BIN-G: Harmeling & Williams, PAMI 2011 • Bridged-islands (BI) algorithm: Liu et al. MLJ, 2013

The BIN-G Algorithm • Learns binary tree models • L All observed variables • Loop • Remove from L pair of variables with highest mutual information • Introduce a new latent variable • Add new latent variable to L

Two Issues • Learn LCM: Cardinality of new latent variable and parameters • Let |H1|=1 and increase it gradually until termination • For each case, run EM to optimize model parameters and calculate the BIC score • Return LCM with highest BIC score • Determine MI between new latent variable and others • Convert the new latent variable into a observed via imputation (hard assignment) • Then calculate MI(H1; X3), MI(H1; X4) NOTE: if some latent variables have cardinality 1, they can be removed from the model, resulting a forest, instead of tree.

Result of BIN-G on subset of 20 Newsgroups Dataset

The BI Algorithm • Learns non-binary trees. • Partitions all observed variables into clusters, with some clusters having >2 variables • Introduces a latent variable for each variable cluster • Links up the latent variables to get a global tree model • The result is a flat latent tree model in the sense that each latent variable it directly connected to at one observed variable.

BI Step 1: Partition the Observed Variables • Identify a cluster of variables such that, • Variables in the cluster are closely correlated, and • The correlations can be properly modeled using a latent variable. • Remove the cluster and repeat the process. • Eventually obtain a partition of the observed variables.

Obtaining Variable Clusters • Sketch of algorithm for identifying first variable cluster • L All observed variables • S  pair of variables with highest mutual information • Loop • X  Variable in L with highest MI with S • SS U {X}, L  L \ {X} • Perform uni-dimensionality test on S, • If the test fails, stop loop and pick the first cluster of variable. • The procedure is repeated on the remaining variables to get more clusters.

Uni-Dimensionality (UD) Test • Test whether the correlations among variables in a set S can be properly modeled using a single latent variable • Example: S={X1, X2, X3, X4, X5} • Learn two models • m1: Best LCM, i.e., LTM with one latent variable • m2: Best LTM with two latent variables • Can be done using EAST • UD-test passes if and only if If the use of two latent variable does not give significantly better model, then the use of one latent variable is appropriate.

Bayes Factor • Unlike a likelihood-ratio test, • Models do not need to be nested • Strength of evidence in favor of M2 depends on the value of K Wikipedia

Bayes Factor and UD-Test • The statistic is a large sample approximation of • Strength of evidence in favor of two latent variables depends on U : • In the UD-test, we usually set : • Conclude single latent variable if no strong evidence for >1 latent variables Wikipedia

UD-Test and Variable Cluster • Initially, S={X1, X2} • X3, X4 added to S, and UD-test passes • Next add X5 • S = {X1, X2, X3, X4, X5}, • m2 is significantly better than m1 • UD-test fails • The first variable cluster is: {X1, X2, X4} • Picked because it contains the initial variables X1 and X2

BI Step 2: Latent Variable Introduction • Introduce a latent variable for each variable cluster. • Optimize the cardinalities of latent variables an parameters

BI Step 3: Link up Latent Variables • Bridging the “islands” using Chow-Liu’s Algorithm (1968) • Estimate joint of each pair of latent variables Y and Y’: m and m’ are the LCMs that contains Y and Y respectively. • Calculate MI(Y;Y’) • Find the maximum spanning tree for MI values

BI Step 4: Global Adjustment • Improvement based on global consideration • Run EM to optimize parameters for whole model • For each latent variable Y and each observed variable X, calculate: • Re-estimate MI(Y; X) based the above distribution • Let Y* be the latent variable with highest MI(Y; X) • If Y* is not currently the neighbor of X, make it so.

Result of BI on subset of 20 Newsgroups Dataset

Part II: Learning Algorithms • Introduction • Search-based algorithms • Algorithms based on variable clustering • Distance-based algorithms • Empirical comparisons • Spectral methods for parameter estimation

Distance-Based Algorithms • Define distance between variables that are additive over trees • Estimate distances between observed variables from data • Inference model structure from those distance estimates • Assumptions: • Latent variables have equal cardinality, and it is known. • In some cases, it equals the cardinality of observed variables. • Or, all variables are continuous. • Focus on two algorithms • Recursive groping (Choi et al, JMLR 2011) • Neighbor Joining (Saitou & Nei, 1987, Studier and Keppler, 1988) Slides based on Choi et al, 2010: www.ece.nus.edu.sg/stfpage/vtan/latentTree_slides.pdf

Information Distance • Information distance between two discrete variables Xiand Xj (Lake 1994) • When both variables are binary:

Additivity of Information Distance on Trees • Erdos, Szekely, Steel, & Warnow, 1999

Testing Node Relationships • This implies the difference is a constant. • It does not change with k. • Equality not true • if j is not leaf, or • iis not the parent of j

Testing Node Relationships • This implies the difference is a constant. • It does not change with k. • It is between – and • This property allows us to determine leaf nodes that are siblings

Latent Tree Models Part III: Learning Algorithms