490 likes | 924 Views
Topic Model Latent Dirichlet Allocation. Ouyang Ruofei. May. 10 2013. Ouyang Ruofei. LDA. Introduction. Parameters:. Inference:. data = latent pattern + noise. Ouyang Ruofei. LDA. Introduction. Parametric Model:. Number of parameters is fixed w.r.t . sample size.
E N D
Topic Model Latent Dirichlet Allocation Ouyang Ruofei May. 10 2013 Ouyang Ruofei LDA
Introduction Parameters: Inference: data = latent pattern + noise Ouyang Ruofei LDA
Introduction Parametric Model: Number of parameters is fixed w.r.t. sample size Nonparametric Model: Number of parameters grows with sample size Infinite dimensional parameter space Ouyang Ruofei LDA
Clustering 1.Ironman 2.Thor 3.Hulk Indicator variable for each data point Ouyang Ruofei LDA
Dirichletprocess Ironman: 3 times Thor: 2 times Hulk: 2 times Without the likelihood, we know that: 1. There are three clusters 2. The distribution over three clusters New data Ouyang Ruofei LDA
Dirichletprocess Example: Dirichlet distribution: Dir(Ironman,Thor,Hulk) pdf: mean: Ouyang Ruofei LDA
Dirichletprocess Conjugate prior Multinomial distribution: Dirichlet distribution: Posterior: Example: Pseudo count Ouyang Ruofei LDA
Dirichletprocess In our Avengers model, K=3 (Ironman, Thor, Hulk) However, this guy comes… Dirichlet distribution can’t model this stupid guy Dirichlet process: K = infinity Nonparametrics here mean infinite number of clusters Ouyang Ruofei LDA
Dirichlet process Dirichlet process: α: Pseudo counts in each cluster G0: Base distribution of each cluster Distribution template Given any partition A distribution over distributions Ouyang Ruofei LDA
Dirichlet process Construct Dirichlet process by CRP Chinese restaurant process: In a restaurant, there are infinite number of tables. Costumer 1 seats at an unoccupied table with p=1. Costumer N seats at table k with p= Ouyang Ruofei LDA
Dirichlet process Ouyang Ruofei LDA
Dirichlet process Ouyang Ruofei LDA
Dirichlet process Ouyang Ruofei LDA
Dirichlet process Ouyang Ruofei LDA
Dirichlet process Customers : data Tables : clusters Ouyang Ruofei LDA
Dirichlet process Train the model by Gibbs sampling Ouyang Ruofei LDA
Dirichlet process Train the model by Gibbs sampling Ouyang Ruofei LDA
Gibbs sampling Gibbssampling is a MCMC method to obtain a sequence of observations from a multivariate distribution The intuition is to turn a multivariate problem into a sequence of univariate problem. In Dirichlet process, Multivariate: Univariate: Ouyang Ruofei LDA
Gibbs sampling Gibbs sampling pseudo code: Ouyang Ruofei LDA
Topic model Document Mixture of topics Latent variable But, we can read words topics words Ouyang Ruofei LDA
Topic model Ouyang Ruofei LDA
Topic model Ouyang Ruofei LDA
Topic model topic of xij observed word word/topic count topic/doc count other topics other words Ouyang Ruofei LDA
Topic model Apply Dirichlet process in topic model Learn the distribution of topics in a document Learn the distribution of topics for a word Ouyang Ruofei LDA
Topic model topic/doc table word/topic table Ouyang Ruofei LDA
Topic model Latent Dirichlet allocation: Dirichlet mixture model: Ouyang Ruofei LDA
LDA Example d1: ipad apple itunes d2: apple mirror queen d3: queen joker ladygaga d4: queen ladygaga mirror w: ipad apple itunes mirror queen joker ladygaga t1: product In fact, the topics are latent t2: story t3: poker Ouyang Ruofei LDA
LDA example 1 2 3 d1: ipad apple itunes 2 1 2 d2: apple mirror queen 3 3 1 d3: queen joker ladygaga 2 1 2 d4: queen ladygaga mirror Ouyang Ruofei LDA
LDA example 1 2 3 d1: ipad apple itunes 2 1 2 d2: apple mirror queen 3 1 d3: joker ladygaga queen 2 1 2 d4: queen ladygaga mirror Ouyang Ruofei LDA
LDA example 1 2 3 d1: ipad apple itunes 2 1 2 d2: apple mirror queen 3 1 d3: joker ladygaga queen 2 1 2 d4: queen ladygaga mirror Ouyang Ruofei LDA
LDA example 1 2 3 d1: ipad apple itunes 2 1 2 d2: apple mirror queen 3 1 d3: joker ladygaga queen 2 1 2 d4: queen ladygaga mirror Ouyang Ruofei LDA
LDA example 1 2 3 d1: ipad apple itunes 2 1 2 d2: apple mirror queen 2 3 1 d3: joker ladygaga queen 2 1 2 d4: queen ladygaga mirror Ouyang Ruofei LDA
Further Dirichlet distribution prior: K topics Supervised Unsupervised Dirichlet process prior: infinite topics Alpha mainly controls the probability of a topic with few training data in the document. Beta mainly controls the probability of a topic with few training data in the words. Ouyang Ruofei LDA
Further Unrealistic bag of words assumption TNG, biLDA Lose power law behavior Pitman Yor language model David Blei has done an extensive survey on topic model http://home.etf.rs/~bfurlan/publications/SURVEY-1.pdf Ouyang Ruofei LDA
Q&A Ouyang Ruofei LDA