220 likes | 338 Views
An Introduction to LDA Tools. Kuan -Yu Chen Institute of Information Science, Academia Sinica. Reference. D. M. Blei et al., “Latent Dirichlet allocation,” Journal of Machine Learning Research , 3, pp. 993–1022, January 2003 .
E N D
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica
Reference • D. M. Blei et al., “Latent Dirichlet allocation,” Journal of Machine Learning Research, 3, pp. 993–1022, January 2003. • D. Blei and J. Lafferty, “Topic models,” in A. Srivastava and M. Sahami, (eds.), Text Mining: Theory and Applications. Taylor and Francis, 2009. • T. Hoffmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, 42, pp. 177–196, 2001. • T. Griffiths and M. Steyvers, ”Finding scientific topics,” in Proc. of the National Academy of Sciences, 2004. • X. Wei and W.B. Croft,”LDA-based document models for ad-hoc retrieval,” in Proc. of ACM SIGIR, 2006.
Outline • A Briefly Review of Mixture Models • Unigram Model • Mixture of Unigrams • Probabilistic Latent Semantic Analysis • Latent Dirichlet Allocation • LDA Tools • GibbsLDA++ • VB-EM source code from Blei • Examples
Unigram Model & Mixture of Unigrams • Unigram model • Under the unigram model, the words of every document are drawn independently from a single multinomial distribution: • Mixture of unigrams • Under this mixture model, each document is generated by first choosing a topic and then generating words independently from the conditional multinomial:
Probabilistic Latent Semantic Analysis • Probabilistic latent semantic analysis (PLSA/PLSI) • The PLSA model attempts to relax the simplifying assumption made in the mixture of unigrams model that each document is generated from only one topic • serves as the mixture weights of the topics for a particular document
Latent Dirichlet Allocation • The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words • LDA assumes the following generative process for each document in a corpus : • Choose • Choose • For each of the N words : • Choose a topic • Choose a word from , a multinomial probability conditioned on the topic
Latent Dirichlet Allocation • Several simplifying assumptions are made: • The dimensionality of Dirichlet distribution is assumed known and fixed • The word probabilities are parameterized by a matrix , which we treat as a fixed quantity that is to be estimated • The Poisson assumption is not critical to anything • Note that document length is independent of all the other data generating variables ( and )
Latent Dirichlet Allocation • Given the parameters and , the joint distribution of a topic mixture , a set of topics, and a set of words is given by: • Integrating over and summing over , we obtain the marginal distribution of a document: • Obtain the probability of a corpus:
Latent Dirichlet Allocation • The key inferential problem is that of computing the posteriori distribution of the hidden variable given a document: • Unfortunately, this distribution is intractable to compute in general • Although the posterior distribution is intractable for exact inference, a wide variety of approximate inference algorithms can be considered for LDA
Latent Dirichlet Allocation - VBEM • The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood • A simple way to obtain a tractable family of lower bound is to consider simple modifications of the original graph model in which some of the edges and nodes are removed
Latent Dirichlet Allocation - VBEM • This family is characterized by the following variationaldistribution: • The desideratum of finding a tight lower bound on the log likelihood translatesdirectly into the following optimization problem: • by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior
GibbsLDA++ • GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference • The main page of GibbsLDA++ is:http://gibbslda.sourceforge.net/ • We can download this tool from:http://sourceforge.net/projects/gibbslda/ • It needs to be compiled on Linux/Cygwin environment
GibbsLDA++ • Extract “GibbsLDA++-0.2.tar.gz” • Run cygwin • Switch current directory to “/GibbsLDA++-0.2” • Execute the commands • Then, we have an executable file “lda.exe” in the “/GibbsLDA++-0.2/src” directory make clean make all
An Example of GibbsLDA++ • Format of the training corpus Total document number word1 word2 2265 40889 44022 10092 2471 9800…. 31677 653 657 17998 1788…... 1521 15820 3015 48825 2690….. 42763 7680 38280 2913 42763….. 42763 2997 732 42472 3844….. 2572 1583 2584 44400 3015….. . . . Doc1 Doc2
An Example of GibbsLDA++ • LDA Parameter Estimation • Command – Parameter Settings lda.exe –est –dfileGibbs_TDT2_Text.txt –alpha 6.25 –beta 0.1 –ntopics 8 –niters 2000 -dfile: the input training data -alpha: the hyper-parameter of LDA -beta: the hyper-parameter of LDA -ntopics: the number of latent topics -niters: the number of iterations
An Example of GibbsLDA++ • Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files: • model.others: This file contains some parameters of LDA model • model.phi: This file contains the word-topic distributions (topic-by-word matrix) • model.theta: This file contains the topic-document distributions (document-by-topic) • model.tassign: This file contains the topic assignments for words in training data • Wordmap.txt: This file contains the maps between words and word's IDs (integer)
VB-EM source code from Blei • Blei implement the Latent Dirichlet Allocation (LDA) by using VB-EM for parameter estimation and inference • The main page of the source code is:http://www.cs.princeton.edu/~blei/lda-c/index.html • We can download this tool from:http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz • It needs to be compiled on Linux/Cygwin environment
VB-EM source code from Blei • Extract “lda-c-dist.tgz” • Run cygwin • Switch current directory to “/lda-c-dist” • Execute the commands • Then, we have an executable file “lda.exe” in the “/lda-c-dist” directory make
An Example of LDA • Format of the training corpus number of unique words appeared times word-id 77 508:1 596:3 612:2 709:1 713:1 ….. 72 508:2 596:5 597:1 653:1 657:3 ….. 88 457:1 508:1 572:2 596:6 795:1 ….. 62 457:1 508:1 596:2 657:1 732:1 ….. 53 336:4 341:1 457:1 596:1 657:1 ….. . . . Doc1 Doc2
An Example of LDA • LDA Parameter Estimation • The input format can be expressed as: • [alpha]: The hyper-parameter of LDA • [k]: The number of latent topics • [settings]: The settings file • [data]: The input training data • [initialization]: Specify how the topics will be initialized • [directory]: The output directory • Command lda.exe est 6.25 8 ./settings.txt Blei_TDT2_Text.txt random ./ lda.exe est [alpha] [k] [settings] [data] [initialization] [directory]
An Example of LDA • The settings file contain several experimented values: • var max iter:The maximum number of iterations for a single document • var convergence: The convergence criteria for inference • em max iter: The maximum number of iterations of VB-EM • em convergence: The convergence criteria for VB-EM • alpha: set “fixed” or “estimate” var max iter 20 var convergence 1e-6 em max iter 100 em convergence 1e-4 alpha estimate
An Example of LDA • The saved models are in three files: • <iteration>.other: This file contains alpha and some other statistical information of LDA model • <iteration>.beta: This file contains the log of the topic distribution over words (topic-by-word matrix) • <iteration>.gamma: This file contains the variational posterior Dirichletsof each document (document-by-topic matrix)