660 likes | 817 Views
AAAI 2014 Tutorial. Latent Tree Models Part IV: Applications. Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. http://www.cse.ust.hk/~lzhang. Applications of Latent Tree Analysis (LTA). What can LTA be used for:
E N D
AAAI 2014 Tutorial Latent Tree ModelsPart IV: Applications Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. http://www.cse.ust.hk/~lzhang
Applications of Latent Tree Analysis (LTA) • What can LTA be used for: • Discovery of co-occurrence patterns in binary data • Discovery of correlation patterns in general discrete data • Discovery of latent variable/structures • Multidimensional clustering • Topic detection in text data • Probabilistic modelling • Applications • Analysis of survey data • Market survey data, social survey, medical survey data • Analysis of text data • Topic detection • Approximate probabilistic inference
Part IV: Applications • Approximate Inference in Bayesian Networks • Analysis of social survey data • Topic detection in text data • Analysis of medical symptom survey data • Software
LTMs for Probabilistic Modelling • Attractive Representation of Joint Distributions • Computationally very simple to work with. • Represent complex relationships among observed variables. • What does the structure look like without the latent variables?
Approximate Inference in Bayesian Networks • In a Bayesian network over observed variables, exact inference can be computationally prohibitive. • Two-phase approximate inference: • Offline • Sample data set from the original network • Learn a latent tree model (secondary representation) • Online • Make inference using the latent tree model. (Fast) (Wang et al. AAAI 2008) Sample Learn LTM
Empirical Evaluations • Alternatives • LTM (1k), LTM (10k), LTM (100k): with different sample size for Phase 1. • CL (100k): Phase 1 learns Chow-Liu tree • LCM (100k): Phase 1 learns latent class model • Loopy Belief Propagation (LBP) • Original networks • ALARM, INSURANCE, MILDEW, BARLEY, etc. • Evaluation: • 500 random queries • Quality of approximation measured using KL from exact answer.
Empirical Results sparse dense • C: cardinality of latent variables • When C is large enough, LTM achieves good approximation in all cases. • Better than LBP on g, d,h • Better than CL on d, h. • Key Advantage: Online phase is 2 to 3orders of magnitude faster than exact inference
Part III: Applications • Approximate Inference in Bayesian networks • Analysis of social survey data • Topic detection • Analysis of medical symptom survey data • Software
Social Survey Data // Survey on corruption in Hong Kong and performance of the anti-corruption agency -- ICAC //31 questions, 1200 samples C_City: s0 s1 s2 s3 // very common, quite common, uncommon, very uncommon C_Gov: s0 s1 s2 s3 C_Bus: s0 s1 s2 s3 Tolerance_C_Gov: s0 s1 s2 s3 //totally intolerable, intolerable, tolerable, totally tolerable Tolerance_C_Bus: s0 s1 s2 s3 WillingReport_C: s0 s1 s2 // yes, no, depends LeaveContactInfo: s0 s1 // yes, no I_EncourageReport: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... I_Effectiveness: s0 s1 s2 s3 s4 //very e, e, a, in-e, very in-e I_Deterrence: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... ….. -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 0 1 1 -1 -1 2 0 2 2 1 3 1 1 4 1 0 1.0 -1 -1 -1 0 0 -1 -1 1 1 -1 -1 0 0 -1 1 -1 1 3 2 2 0 0 0 2 1 2 0 0 2 1 0 1.0 -1 -1 -1 0 0 -1 -1 2 1 2 0 0 0 2 -1 -1 1 1 1 0 2 0 1 2 -1 2 0 1 2 1 0 1.0 ….
Latent Structure Discovery Y2: Demographic info; Y3: Tolerance toward corruption; Y4: ICAC performance; Y5: Change in level of corruption; Y6: Level of corruption; Y7: ICAC accountability
Multidimensional Clustering Y2=s0: Low income youngsters; Y2=s1: Women with no/low income; Y2=s2: people with good education and good income; Y2=s3: people with poor education and average income.
Multidimensional Clustering Y3=s0: people who find corruption totally intolerable; 57% Y3=s1: people who find corruption intolerable; 27% Y3=s2: people who find corruption tolerable; 15% Interesting finding: Y3=s2: 29+19=48% find C-Gov totally intolerable or intolerable; 5% for C-Bus Y3=s1: 54% find C-Gov totally intolerable; 2% for C-Bus Y3=s0: Same attitude towardC-Gov and C-Bus People who are tough on corruption are equally tough toward C-Gov and C-Bus. People who are lenientabout corruption are more lenientC-Bus than C-GOv
Multidimensional Clustering • Who are the toughest toward corruption among the 4 groups? Y2=s2: ( good education and good income) the least tolerant. 4% tolerable Y2=s3: (poor education and average income) the most tolerant. 32% tolerable The other two classes are in between. • Summary: Latent tree analysis of social survey data can reveal • Interesting latent structures • Interesting clusters • Interesting relationships among the clusters.
Part III: Applications • Approximate Inference • Analysis of social survey data • Topic detection (Analysis of text data) • Analysis of medical symptom survey data • Software
Latent Tree Models for Topic Detection • Basics • Aggregation of miniature topics • Topic extraction and characterization • Empirical results
What is a topic in LTA? LTM for toy text data • Topic: State of latent variable, soft collection of documents • Characterized by: Conditional probability of word given latent state, or, document frequency of word in collection: # docs containing the word / total # of docs in the topic • Probabilities all words for a topic (in a column) do not sum to 1. • Y1=2: oop; Y1=1: Programming; Y1=0: background • Background topics for other latent variables not shown.
How are topics and documents are related? • Topic: A collection of documents • A document is a member of a topic • Can belong to multiple topics with different probabilities • Probabilities for each document (in each row) do not sum to 1. D97, D115, D205, D528 are documents from the toy text data Table shows: • D97 is a web page on OOP from U of Wisconsin Madison • D528 is a web page on AI from U of Texas Austin
LTA Differs from Latent Dirichlet Allocation (LDA) • LDA Topic: Distribution over vocabulary • Probability a writer would use a word when writing about the topic # of times the word is used / total # of word-occurrences used for the topic • Probabilities for a topic (in a column) sum to 1 • In LDA a document is a mixture of topics (LTA: Topic is a collection of documents) • Probabilities in each row sum to 1
Latent Tree Models for Topic Detection • Basics • Aggregation of miniature topics • Topic extraction and characterization • Empirical results
Latent Tree Model for a Subset of Newsgroup Data • Latent variable give miniature topics. • Intuitively, more interesting topics can be detected if we combine • Z11, Z12, Z13 • Z14, Z15, Z16 • Z17, Z18, Z19 • BI algorithm produces flat models: Each latent variable directly connected to at least one observed variables.
Hierarchical Latent Tree Analysis (HLTA) • Convert the latent variables into observed one via hard assignment. • Afterwards, Z11-Z19 become observed. • Run BI on Z11-Z19
Hierarchical Latent Tree Analysis (HLTA) • Stack model for Z11-Z19 on top of model for the words • Repeat until no more than 2 latent variables or predetermined level reached.
Hierarchical Latent Tree Analysis (HLTA) • Part II: Cannot determine edge orientations based solely on data. • Here hierarchical structure introduced to improve model interpretability. Data + interpretability hierarchical structure. • It does not necessarily improve model fit. • The result is called a hierarchical latent tree model (HLTM)
Latent Tree Models for Topic Detection • Basics • Aggregation of miniature topics • Topic extraction and characterization • Empirical results
Semantic Base • Interpreting states of Z21 • Z11, Z12, and Z13 introduced because of co-occurrence of • “computer”, “Science”; • “card”, “display”, …., “video”; and • “dos” , “windows” • Z21 introduced because of correlations among Z11, Z12, Z13 • So, interpretation of the states of Z21 is to be based on the words in the sub-tree rooted at Z21. They form the semantic base of Z21.
Effective Semantic Base • Semantic base might be too large to handle. • Effective base: Subset of semantic base that matters. • Sort variables Xi from semantic base in descending of I(Z; Xi). • I(Z; X1, …, Xi): Mutual information between Z and first i-th variables • Estimated via sampling, increases with i. • I(Z; X1, …, Xm): Mutual information between Z and all m variables in semantic base • Information coverage of the first i-th variable I(Z; X1, …, Xi)/ I(Z; X1, …, Xm): • Effective semantic base: • Set of leading variables with information coverage higher than a certain level, i.e., 95%. Chen et al. AIJ 2012
Effective semantic bases are typically smaller than Semantic bases. • Z22: Semantic base --10 variables, Effective semantic base – 8 variable • Differences are much larger in models with hundreds of variables. • Words are the front are more informative in distinguishing between the states of the latent variable. Upper: Information coverage Lower: Mutual Information
Topic Characterizations • HLTA characterizes Latent state (topics) using probabilities of words from effective semantic base • NOT sorted according to probability, but mutual information • Topic Z22=s1 characterized using words • Occur with high probabilities in documents on to the topic, and • Occur with low probability in documents NOT on the topic. • LDA, HLDA, … • Topic characterized using words that occur with highest probability in the topic. • Not necessarily the best words to distinguish the topic from other topics.
Latent Tree Models for Topic Detection • Basics • Aggregation of miniature topics • Topic extraction and characterization • Empirical results
Empirical Results • Show the results of HLTA on real-world data • Compare HLTA with HLDA and LDA
NIPS Data • 1,740 papers published at NIPS between 1988 – 1999. • Vocabulary: • 1,000 words selected using average TF-IDF. • HLTA produced a model with 382 latent variables, arranged on 5 levels. • Level 1 – 279; Level 2 – 72; Level 3 - 21; Level 4 - 8; Level 5 - 2 • Example topics on next few slides • Topic characterizations, topic sizes, • Topic groups, topic group labels. • For details: http://www.cse.ust.hk/~lzhang/ltm/index.htm
HLTA Topics: Level-3 • reinforcement markov speech hmm transition 0.20 markov speech speaker hmms hmm 0.14 speech hmm speaker hmmsmarkov 0.13 reinforcement suttonbarto policy actions 0.10 reinforcement suttonbarto actions policy • cells neurons cortex firing visual 0.17 visual cells cortical cortex activity 0.27 cells cortex cortical activity visual 0.33 neurons neuron synaptic synapses 0.18 membrane potentials spike spikes firing 0.15 firing spike membrane spikes potentials 0.18 circuit voltage circuits vlsi chip 0.26 dynamics dynamical attractor stable attractors • ….. • likelihood bayesian statistical gaussian conditional 0.34 likelihood bayesian statistical conditional 0.16 gaussian covariance variance matrix 0.21 eigenvalues matrix gaussian covariance • trained classification classifier regression classifiers 0.25 validation regression svm machines 0.07 svm machines vapnik regression 0.38 trained test table train testing 0.30 classification classifier classifiers class cl • images image pixel pixels object 0.25 images image pixel pixels texture 0.16 receptive orientation objectsobject 0.21 objectobjects perception receptive • hidden propagation layer backpropagation units 0.40 hidden backpropagation multilayer architecture architectures 0.40 propagation layer units back net
HLTA Topics: Level-2 • reinforcement suttonbarto actions policy 0.12 transition states reinforcement reward 0.10 reinforcement policy reward states 0.14 trajectory trajectories path adaptive 0.12 actions action control controller agent 0.09 suttonbarto td critic moore • markovspeech hmm speaker hmms 0.14 markov stochastic hmms sequence hmm 0.10 hmm hmms sequence markov stochastic 0.15 speech language word speaker acoustic 0.06 speech speaker acoustic word language 0.16 delay cycle oscillator frame sound 0.10 frame sound delay oscillator cycle 0.14 strings string length symbol
HLTA Topics: Level-2 • regression validation vapniksvm machines 0.24 regression svmvapnik margin kernel 0.05 svmvapnik margin kernel regression 0.19 validation cross stopping pruning 0.07 machines boosting machine boltzmann • classification classifier classifiers class classes 0.28 classification classifier classifiers class 0.24 discriminant label labels discrimination 0.13 handwritten digit character digits • trained test table train testing 0.38 trained test table train testing 0.44 experiments correct improved improvement correctly • … • likelihood bayesian statistical conditional posterior 0.34 likelihood statistical conditional density 0.35 entropy variables divergence mutual 0.19 probabilistic bayesian prior posterior 0.11 bayesian posterior prior bayes 0.15 mixture mixtures experts latent 0.14 mixture mixtures experts hierarchical 0.34 estimate estimation estimating estimated 0.21 estimate estimation estimates estimated • gaussiancovariance matrix variance eigenvalues 0.09 matrix pcagaussian covariance variance 0.23 gaussian covariance variance matrix pca 0.09 pcagaussian matrix covariance variance 0.18 eigenvalues eigenvalue eigenvectors ij 0.15 blind mixing ica coefficients inverse
HLTA Topics: Level-1 • mixture mixtures experts hierarchical latent 0.19 mixture mixtures 0.34 multiple individual missing hierarchical 0.15 hierarchical sparse missing multiple 0.07 experts expert 0.32 weighted sum • estimate estimation estimated estimates estimating 0.38 estimate estimation estimated estimating 0.19 estimate estimates estimation estimated 0.29 estimator true unknown 0.33 sample samples 0.40 assumption assume assumptions assumed 0.27 observations observation observed • … • likelihood statistical conditional density log 0.30 likelihood conditional log em maximum 0.42 statistical statistics 0.19 density densities • entropy variables variable divergence mutual 0.16 entropy divergence mutual 0.31 variables variable • bayesianposterior probabilistic prior bayes 0.19 bayesian prior bayes posterior priors 0.09 bayesian posterior prior priors bayes 0.29 probabilistic distributions probabilities 0.16 inference gibbs sampling generative 0.19 mackay independent averaging ensemble 0.08 belief graphical variational 0.09 montecarlo 0.09 uk ac Reason for aggregate miniature topics: Many Level 1 topics correspond to trivial word co-occurrences , not meaningful
HLTA Topics: Level-4 & 5 Level 4 • visual cortex cells neurons firing 0.34 cells cortex firing neurons visual 0.28 cells neurons cortex firing visual 0.41 approximation gradient optimization 0.29 algorithms optimal approximation 0.39 likelihood bayesian statistical gaussian • images image trained hidden pixel 0.22 regression classification classifier 0.29 trained classification classifier classifiers 0.02 classification classifier regression 0.28 learn learned structure feature features 0.23 feature features structure learn learned 0.24 images image pixel pixels object 0.13 reinforcement transition markov speech 0.14 speech hmm markov transition 0.40 hidden propagation layer backpropagationunits Level 5 • visual cortex cells neurons firing 0.37 visual cortex firing neurons cells 0.39 visual cells firing cortex neurons 0.25 images image pixel hidden trained 0.09 hidden trained images image pixel 0.20 trained hidden images image pixel 0.15 image images pixel trained hidden
Summary of HLTA Results on NIPS Data • Level 1: 279 latent variables • Many capture trivial word co-occurrence patterns • Level 2: 72 latent variables • Meaningful topics, and meaningful topic groups • Level 3 : 21 latent variables • Meaningful topics, and meaningful topic groups • More general than Level 2 topics • Level 4: 8 latent variables • Meaningful topics, very general • Level 5: 2 latent variables • Too few • In application, one can choose to output the topics at a certain level according the desired number of topics. • For NIPS data, either level-2 topics or level-3 topics.
HLDA Topics • control optimal algorithms approximation step policy action reinforcement states actions experts mixture em expert gaussian convergence gradient batch descent means control controller nonlinear series forward distance tangent vectors euclidean distances robot reinforcement position control path bias variance regression learner exploration blocks block length basic experiment td evaluation features temporal expert path reward light stimuli paths Long hmms recurrent matrix term channel call cell channels rl • image images recognition pixel feature video motion visual speech recognition face images faces recognition facial ocular dominance orientation cortical cortex character characters pca coding field resolution false true detection context • …. units hidden layer unit weight • gaussianlog density likelihood estimate margin kernel support xi bound generalization student weight teacher optimal gaussianbayesian kernel evidence posterior chip analog circuit neuron voltage classifier rbf class classifiers classification speech recognition hmm context word icaindependent separation source sources image images matching level object tree trees node nodes boosting variables variable bayesian conditional family face strategy differential functional weighting source grammar sequences polynomial regression derivative em machine annealing max min • regression prediction selection criterion query validation obs generalization cross pruning mlp risk classifier classification confidence loss song transfer bounds wt principal curve eq curves rules
LDA Topics units unit hidden connections connected hmm markov probabilities hidden hybrid object objects recognition view shape robot environment goal grid world entropy natural statistical log statistics experts expert gating architecture jordan trajectory arm inverse trajectories hand sequence step sequences length s gaussian density covariance densities positive negative instance instances np target detection targets FALSE normal activity active module modules brain mixture likelihood em log maximum channel stage channels call routing term long scale factor range … inputs outputs trained produce actual dynamics dynamical stable attractor synaptic synapses inhibitory excitatory correlation power correlations cross states stochastic transition dynamic basis rbf radial gaussian centers solution constraints solutions constraint type elements group groups element edge light intensity edges contour recurrent language string symbol strings propagation back rumelhartbphinton ii region regions iii chain graph matching annealing match context mlp letter nn letters fig eq proposed fast proc variables variable belief conditional i pp vol ca edsieee
Comparisons between HLTA and HLDA HLDA Topics • gaussian log density likelihood estimate margin kernel support xi bound generalization student weight teacher optimal gaussianbayesian kernel evidence posterior chip analog circuit neuron voltage classifier rbf class classifiers classification speech recognition hmm context word • control optimal algorithms approximation step policy action reinforcement states actions experts mixture em expert gaussian convergence gradient batch descent means control controller nonlinear series forward distance tangent vectors euclidean distances robot reinforcement position control path bias variance regression learner exploration blocks block length basic experiment HLTA Topics • likelihood bayesian statistical conditional posterior 0.34 likelihood statistical conditional density 0.35 entropy variables divergence mutual 0.19 probabilistic bayesian prior posterior 0.11 bayesian posterior prior bayes 0.15 mixture mixtures experts latent 0.14 mixture mixtures experts hierarchical • reinforcement suttonbarto actions policy 0.12 transition states reinforcement reward 0.10 reinforcement policy reward states 0.14 trajectory trajectories path adaptive 0.12 actions action control controller agent 0.09 suttonbarto td critic moore • HLTA topics have sizes, HLDA/LDA topics do not • HLTA produces better hierarchy • HLTA gives better topic characterizations
Measure of Topic Quality • Suppose a topic t is described using M words • The topic coherence score for t is: • Idea • The words for a topic would tend to co-occur. • Given a list of words, the more often the words co-occur, than the better the list is as a definition of a topic. • Note: • Score decreases with M. • Topics be compared should be described using the same number of words D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262–272, 2011.
HLTA Found More Coherent Topics than LDA and HLDA • HLTA (L3-L4): All non-background topics from Levels 3 and 4: 47 • HLTA (L2-L3-L4): All non-background topics from Levels 2, 3 and 4: 140 • LDA was instructed to find two sets of topics with 47 and140 topics • HLDA found more 179. • HLDA-s: A subset of the HLDA topics were sampled for fair comparison.
Comparisons in Terms of Model Fit • Regard LDA, HLDA and HLTA as methods for text modeling • Build a probabilistic model for the corpus • Evaluation: • Per-document held-out loglikelihood(-log(perplexity)). • Measure performance of model on predicting unseen data • Data: • NIPS: 1,740 papers from NIPS, 1,000 words, • JACM: 536 abstracts from J of ACM, 1,809 words. • NEWSGROUP: 20,000 newsgroup posts, 1,000 words.
HLTA results robust w.r.t UD-test threshold • The values 1, 3, 5 are from literature on Bayes factor (see Part III) • LDA produced by far worst models in all cases. • HLTA out-performed HLDA on NIPS, tied on JACP, and beaten on Newsgroup • Caution: Better model does not implies better topics • Running time on NIPS: • LDA – 3.6 hours, HLTA – 17 hours, HLDA – 68 hours.
Summary • LDA, HLDA • Topic: Distribution over vocabulary • Don’t have sizes • Characterization: Words occur with high probability in topic • Document: A mixture of topics • HLTA • Topic: collection of documents • Have sizes • Characterization: Words occur with high probability in topic, low probability in other documents • Document: A member of topic, can belong to multiple topics with probability 1. • HLTA produces better hierarchy than HLDA • HLTA produce more coherent topics than LDA and HLDA
Word Selection for HLTA • HLTA detects word co-occurrence patterns, and partitions documents based on those patterns. • Whether a co-occurrence patterns are meaningful to the user depends on what words are selected for analysis. • If only stop words are used, cannot meaningful topics regardless method • If use a mixture of “meaningful word” and “not meaningful words”, how to separate them? • For the NIPS data, 1,000 words selected using average TF-IDF. Meaningful co-occurrence patterns detected. • Might not work other data sets. (For UAI 2014 abstracts, words selected manually based on DF : auai.org/uai2014/submissionStats.shtml ) • Is it possible to identify thematically meaningful words based on frequencies alone? If not, other factors need to be considered.
Part III: Applications • Approximate Inference in Bayesian networks • Analysis of social survey data • Topic detection • Analysis of medical symptom survey data • Software
Background of Research • Common practice in China, increasingly in Western world • Patients of a WM disease divided into several TCM classes • Different classes are treated differently using TCM treatments. • Example: • WM disease: Depression • TCM Classes: • Liver-Qi Stagnation (肝气郁结). Treatment principle: 疏肝解郁,Prescription: 柴胡疏肝散 • Deficiency of Liver Yin and Kidney Yin (肝肾阴虚):Treatment principle: 滋肾养肝,Prescription: 逍遥散合六味地黄丸 • Vacuity of both heart and spleen (心脾两虚). Treatment principle: 益气健脾, Prescription: 归脾汤 • ….
Key Question • How should patients of a WM disease be divided into subclasses from the TCM perspective? • What TCM classes? • What are the characteristics of each TCM class? • How to differentiate different TCM classes? • Important for • Clinic practice • Research • Randomized controlled trials for efficacy • Modern biomedical understanding of TCM concepts • No consensus. Different doctors/researchers use different schemes. Key weakness of TCM.
Key Idea • Our objective: • Provide an evidence-based methodfor TCM patient classification • Key Idea • Cluster analysis of symptom data => empirical partition of patients • Check to see whether it corresponds to TCM class concept • Key technology: Multidimensional clustering • Motivation for developing latent tree analysis