960 likes | 1.16k Views
AAAI 2014 Tutorial. Latent Tree Models Part III: Learning Algorithms. Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. http://www.cse.ust.hk/~lzhang. Learning Latent Tree Models. To Determine Number of latent variables
E N D
AAAI 2014 Tutorial Latent Tree ModelsPart III: Learning Algorithms Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. http://www.cse.ust.hk/~lzhang
Learning Latent Tree Models To Determine • Number of latent variables • Cardinality of each latent variable • Model structure • Probability distributions Model selection: 1, 2, 3 Parameter estimation: 4
Light Bulb Illustration • Run interactive program “LightBulbIllustration.jar” • Illustrate the possibility of inferring latent variables and latent structures from observed co-occurrence patterns.
Part III: Learning Algorithms • Introduction • Search-based algorithms • Algorithms based on variable clustering • Distance-based algorithms • Empirical comparisons • Spectral methods for parameter estimation
Search Algorithms • A search algorithm explores the space of regular model guided by a scoring function: • Start with an initial model • Iterate until model score ceases to increase • Modify the current model in various ways to generate a list of candidate models. • Evaluate the candidate models using the scoring function. • Pick the best candidate model • What scoring function to use? How do we evaluate candidate models? • This is the model selection problem.
Model Selection Criteria • Bayesian score: posterior probability P(m|D) P(m|D) = P(m)P(D|m) / P(D) = P(m)∫ P(D|m, θ) P(θ |m) dθ / P(D) • BIC Score: Large sample approximation of Bayesian score BIC(m|D) = log P(D|m, θ*) – d/2 logN • d: number of free parameters; N is the sample size. • θ*: MLE of θ, estimated using the EM algorithm. • Likelihood term of BIC: Measure how well the model fits data. • Second term: Penalty for model complexity. • The use of the BIC score indicates that we are looking for a model that fits the data well, and at the same time, not overly complex.
Model Selection Criteria • AIC (Akaike, 1974): AIC(m|D) = log P(D|m, θ*) – d/2 • Holdout likelihood • Data => Training set, validation set. • Model parameters estimated based on the training set. • Quality of model is measured using likelihood on the validation set. • Cross validation: too expensive
Search Algorithms • Double hill climbing (DHC), (Zhang 2002, 2004) • 7 manifest variables. • Single hill climbing (SHC), (Zhang and Kocka 2004) • 12 manifest variables • Heuristic SHC (HSHC), (Zhang and Kocka 2004) • 50 manifest variables • EAST, (Chen et al 2011) • 100+ manifest variables
Double Hill Climbing (DHC) • Two search procedures • One for model structure • One for cardinalities of latent variables. • Very inefficient. Tested only on data sets with 7 or fewer variables. (Zhang 2004) • DHC tested on synthetic and real-world data sets, together with BIC, AIC, and Holdout likelihood respectively. • Best models found when BIC was used. • So subsequent work based on BIC.
Single Hill Climbing (HSC) • Determines both model structure and cardinalities of latent variables using a single search procedure. • Uses five search operators • Node Introduction (NI) • Node Deletion (ND) • Node Relation (NR) • State Introduction (SI) • State Deletion (SI)
Node Introduction (NI) • NI involves a latent variable Y and some of its neighbors • It introduces a new node Y’ to mediateY and the neighbors. • The cardinality of Y’ is set at |Y| • Example: • Y2 introduced to mediate Y1 and its neighbors X1 and X2 • The cardinality of Y2 is set at |Y1|
Node Relocation (NR) • NR involves a latent variable Y, a neighbor Z of Y, and another neighbor Y’ of Y that is also a latent variable. • It relocates Z from Y to Y’. • Example: • X3 is relocated from Y1 to Y2
Node Deletion • ND involves a latent variable Y, a neighbor Y’ of Y that is a latent variables. • It remove Y, and reconnects the other neighbors of Y to Y’. • Example: • Y2 is removed w.r.t to Y1.
State Introduction/Deletion • State introduction (SI) • Increase the number of states of a latent variable by 1 • State deletion (SD) • Reduce the number of states of a latent variable by 1.
Single Hill Climbing (SHC) • Start with an initial model (LCM) • At each step: • Construct all possible candidate models using NI, ND, NR, SI and SD • Evaluate them one by one • Pick the best one • Still inefficient • Tested on data with no more than 12 variables. • Reason • Too many candidate models • Too expensive to run EM on all of them
The EAST Algorithm • Scale up SHC • Idea 1: Restrict NI to involve only two neighbors of the latent variable it operators on
Reachability • How to go from the left to the right then with the restriction? • First apply NI, and then NR NI NR
Idea 2: Reducing Number of Candidate Models • Not to use ALL the operators at once. • How? • BIC: BIC(m|D) = log P(D|m, θ*) – d/2 logN • Improve the two terms alternately • NI and SI improve the likelihood term? • Let be m’ obtained from m using NI or SI • Then, m’ includes m, hence has higher maximized likelihood log P(D|m’, θ’*) >= log P(D|m, θ*) • SD and ND reduce the penalty term.
The EAST Algorithm (Chen et al. AIJ 2011) • Start with a simple initial model • Repeat until model score ceases to improve • Expansion: • Search with node introduction (NI), and state introduction (SI) • Each NIoperation is followed by NRoperations to compensate for the restriction on NI. (See Slide 17) • Adjustment: • Search with NR • Simplification: • Search with node deletion (ND), and state deletion (SD) EAST: Expansion, Adjustment, Simplification until Termination
Idea 3: Parameter Value Inheritance • m: current model; • m’ :candidate model generated by applying a search operator on m. • The two models share many parameters • m: ( θ1,θ2); m’: ( θ1, λ2); • When evaluating m’, inherit values of the shared parameters θ1 from m, and estimate only the new parameters λ2: λ*2 = arg max λ2 log P(D|m’, θ1,λ2 )
Avoid Local Optimum at the Expansion Phase • NI: Increases structure complexity. • SI: Increases variable complexity. • Key Issue at the expansion phase: • Tradeoff between structure complexity and variable complexity
Operation Granularity • NI and SI are of different granularities • p = 100 • SI: 101 more parameters • NI: 2 more parameters • Huge disparity in granularity • Penalty term in BIC insufficient to handle • SI always preferred initially, • Quick increase in variable complexity • Leading to local optimum in model score
Dealing with Operation Granularity • EAST does not use BIC when choosing between candidate models produced by NI and SI. • Instead, it uses thecost-effectiveness principle • That is, select candidate model with highest improvement ratio Increase in model score per unit increase in model complexity. • Denominator is larger for operations that increase the number of model parameters more. • Can be justified using Likelihood Ratio Test (LRT) • It picks the candidate model that gives the strongest evidence to reject the null model (the current model) according to LRT.
Likelihood Ratio Test (LRT) Wikipedia • The alternative model includes the null model, and hence fits data better than the null model. • Whether it fits significantly better is determined by p-valueof the difference D , which approximately follows Chi-squared distribution with degree of freedom: d2 – d1
Likelihood Ratio Test (LRT) • Required D value for given p-value increases roughly linearly with d2-d1 • The ratio D/(d2-d1) closely related to p-value • It is a measure of the strength of evidence in favor of the alternative model .
Likelihood Ratio Test & Improvement Ratio • Second term is constant • First term is exactly ½ * D / (d2-d1) • Loosely speaking, the cost-effectiveness principle picks the candidate model that gives the strongest evidence to reject the null model (the current model) according to LRT.
EAST used on medical survey data • A few dozens variables • Hundreds to thousands observations • Model quality important • (Xu et al 2013)
Part III: Learning Algorithms • Introduction • Search-based algorithms • Algorithms based on variable clustering • Distance-based algorithms • Empirical comparisons • Spectral methods for parameter estimation
Algorithm based on Variable Clustering • Key Idea • Group variables into clusters • Introduce a latent variable for each cluster • For discrete variables, mutual information is used as similarity measure • Algorithms • BIN-G: Harmeling & Williams, PAMI 2011 • Bridged-islands (BI) algorithm: Liu et al. MLJ, 2013
The BIN-G Algorithm • Learns binary tree models • L All observed variables • Loop • Remove from L pair of variables with highest mutual information • Introduce a new latent variable • Add new latent variable to L
Two Issues • Learn LCM: Cardinality of new latent variable and parameters • Let |H1|=1 and increase it gradually until termination • For each case, run EM to optimize model parameters and calculate the BIC score • Return LCM with highest BIC score • Determine MI between new latent variable and others • Convert the new latent variable into a observed via imputation (hard assignment) • Then calculate MI(H1; X3), MI(H1; X4) NOTE: if some latent variables have cardinality 1, they can be removed from the model, resulting a forest, instead of tree.
The BI Algorithm • Learns non-binary trees. • Partitions all observed variables into clusters, with some clusters having >2 variables • Introduces a latent variable for each variable cluster • Links up the latent variables to get a global tree model • The result is a flat latent tree model in the sense that each latent variable it directly connected to at one observed variable.
BI Step 1: Partition the Observed Variables • Identify a cluster of variables such that, • Variables in the cluster are closely correlated, and • The correlations can be properly modeled using a latent variable. • Remove the cluster and repeat the process. • Eventually obtain a partition of the observed variables.
Obtaining Variable Clusters • Sketch of algorithm for identifying first variable cluster • L All observed variables • S pair of variables with highest mutual information • Loop • X Variable in L with highest MI with S • SS U {X}, L L \ {X} • Perform uni-dimensionality test on S, • If the test fails, stop loop and pick the first cluster of variable. • The procedure is repeated on the remaining variables to get more clusters.
Uni-Dimensionality (UD) Test • Test whether the correlations among variables in a set S can be properly modeled using a single latent variable • Example: S={X1, X2, X3, X4, X5} • Learn two models • m1: Best LCM, i.e., LTM with one latent variable • m2: Best LTM with two latent variables • Can be done using EAST • UD-test passes if and only if If the use of two latent variable does not give significantly better model, then the use of one latent variable is appropriate.
Bayes Factor • Unlike a likelihood-ratio test, • Models do not need to be nested • Strength of evidence in favor of M2 depends on the value of K Wikipedia
Bayes Factor and UD-Test • The statistic is a large sample approximation of • Strength of evidence in favor of two latent variables depends on U : • In the UD-test, we usually set : • Conclude single latent variable if no strong evidence for >1 latent variables Wikipedia
UD-Test and Variable Cluster • Initially, S={X1, X2} • X3, X4 added to S, and UD-test passes • Next add X5 • S = {X1, X2, X3, X4, X5}, • m2 is significantly better than m1 • UD-test fails • The first variable cluster is: {X1, X2, X4} • Picked because it contains the initial variables X1 and X2
BI Step 2: Latent Variable Introduction • Introduce a latent variable for each variable cluster. • Optimize the cardinalities of latent variables an parameters
BI Step 3: Link up Latent Variables • Bridging the “islands” using Chow-Liu’s Algorithm (1968) • Estimate joint of each pair of latent variables Y and Y’: m and m’ are the LCMs that contains Y and Y respectively. • Calculate MI(Y;Y’) • Find the maximum spanning tree for MI values
BI Step 4: Global Adjustment • Improvement based on global consideration • Run EM to optimize parameters for whole model • For each latent variable Y and each observed variable X, calculate: • Re-estimate MI(Y; X) based the above distribution • Let Y* be the latent variable with highest MI(Y; X) • If Y* is not currently the neighbor of X, make it so.
Part II: Learning Algorithms • Introduction • Search-based algorithms • Algorithms based on variable clustering • Distance-based algorithms • Empirical comparisons • Spectral methods for parameter estimation
Distance-Based Algorithms • Define distance between variables that are additive over trees • Estimate distances between observed variables from data • Inference model structure from those distance estimates • Assumptions: • Latent variables have equal cardinality, and it is known. • In some cases, it equals the cardinality of observed variables. • Or, all variables are continuous. • Focus on two algorithms • Recursive groping (Choi et al, JMLR 2011) • Neighbor Joining (Saitou & Nei, 1987, Studier and Keppler, 1988) Slides based on Choi et al, 2010: www.ece.nus.edu.sg/stfpage/vtan/latentTree_slides.pdf
Information Distance • Information distance between two discrete variables Xiand Xj (Lake 1994) • When both variables are binary:
Additivity of Information Distance on Trees • Erdos, Szekely, Steel, & Warnow, 1999
Testing Node Relationships • This implies the difference is a constant. • It does not change with k. • Equality not true • if j is not leaf, or • iis not the parent of j
Testing Node Relationships • This implies the difference is a constant. • It does not change with k. • It is between – and • This property allows us to determine leaf nodes that are siblings