640 likes | 764 Views
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text. Ivan Titov April 23 2010. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A A A. Outline. Topic Models: An Example of Latent Dirichlet Allocation
E N D
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April 23 2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA
Outline • Topic Models: An Example of Latent Dirichlet Allocation • Learning : Expectation Maximization Algorithms • Learning: Gibbs sampling
Problem set-up • Given: a collection of documents • Want: • Detect key ‘topics’ discussed in the collection • For each document detect which ‘topics’ are discussed there • Requirements: • No supervision (documents are not labeled) • Probabilistic methods
Motivation • Visualization of collections: • what are the topics discussed? • which documents discuss a topic? • Opinion mining: • what is sentiment towards a product aspects? • what are important aspects of a product? • Dimensionality reduction: • for information retrieval • for document classification • Summarization: • ensuring topic coverage in a summary • . . .
Latent Semantic Analysis [Deerwester et al., 1990] • Decomposition of the coccurence matrix • Approximate the co-occurence matrix Hope: terms having common meaning are mapped to the same direction - Hope: documents discussing similar topics have similar representation - Non-zero inner products between documents with non-overlapping terms
Latent Semantic Analysis [Deerwester et al., 1990] • Optimal rank k approximation (Frobenius norm):
Latent Semantic Analysis [Deerwester et al., 1990] • Not motivated probabilistically (no clean underlying probability model) • No obvious interpretation of directions
Probabilistic LSA [Hofmann, 99] • Parameters: • Distributions of topics in document P(z | d), for every d • Distribution of words for every topic, P(w | z), z 2 {1, …K} • Generative story: • For each document d • For each word occurrence i in document d • Select topic zi for the word from P(zi | d) • Generate word wi from P(wi | zi) • Note: • Words in the same documents can be generated from different topics
Probabilistic LSA : Example … Delaysdue to the volcanic ashcloud will affect Formula1 teams’ preparations … Document 1 Generative story: Given parameters generate text … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to the volcanic ashcloud will affect Formula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to the volcanic ashcloud will affect Formula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to the volcanic ashcloud will affect Formula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ashcloud will affect Formula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanicashcloud will affect Formula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ashcloud will affect Formula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwill affect Formula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations … Document 1 … Obamawill not attendtheceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations … Document 1 … Obamawill not attendthe ceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations … Document 1 … Obamawill not attendthe ceremonydue to delays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations … Document 1 … Obamawill not attendthe ceremonydue todelays caused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations … Document 1 … Obamawill not attendthe ceremonydue todelayscaused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
Probabilistic LSA : Example … Delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations … Document 1 … Obamawill not attendthe ceremonydue todelayscaused by eruption … Document 2 Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
PLSA: Example • Note: • Words in the same documents can be generated from different topics • Does not take into account order, the following to texts are guaranteed to have the same probability under the model delaysdue to thevolcanic ash cloudwillaffectFormula1 teams’ preparations to ash due Formula1affect volcanic the willteams’ cloud preparations delays
We considered: Document 1 ? Document 2 ? Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party …
In fact we are solving reverse problem: … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 1 ? We are going to talk about learning in a moment … Obama will not attend the ceremony due to delays caused by eruption … Document 2 ? Sport P(w | z = 2) Eruption P(w | z = 1) Politics P(w | z = 3) football teams ball preparations Formula1 … delays volcanic volcano ash cloud … 0.004 0.002 0.002 0.001 0.001 … 0.005 0.001 0.001 0.0001 0.0001 … 0.003 0.002 0.001 0.001 0.001 … Obama Merkel ceremony attend party … ? ? ?
Directed Graphical Models Doc 1 Doc M … … … NM N1 - Roughly, arrows denote conditional dependencies - A generative story corresponds to a topological order on the graph
Plate Notation Doc 1 Doc M … … … NM N1 N M
Plate Notation Generated from P(z | d) N M Generate from P(w | z)
If they are variables, maybe we can generate them from some variables? Prior distribution for word distributions N K M Prior distribution for topic distributions in documents This is essentially the Latent Dirichlet Allocation (LDA) model (Blei et al, 2001): Hierarchical Bayesian generalization of PLSA
Latent Dirichlet Allocation • Parameters: two scalars: and • Generative story: • For draw word distrib-s • For each document d • Draw topic distribs • For each word occurrence i in document d • Select topic zi for the word from • Generate word wi from We will talk about semantics of these parameters later Dirichlet distribution: think of it as a “distribution of over distributions”
Dirichlet distribution • Defines a distribution over p1,..,pK-1 such as: • - can be regarded as a distribution • “Distribution over distributions” (K-1 dimensional simplex) • Form of the Dirichlet distribution (with symmetric prior) Normalization (Beta function)
Meaning of hyperparameter • Consider case K= 2 (i.e., Beta distribution), pdf: ® = 0.5 : prefers “biased” distributions Often, we want sparse distributions: though the collection may represent 1,000 topics, each document discusses 2-3 topics. In this case we use small ® Can be regarded as “pseudo counts”: ®=2 means that you observed 1 example of each class ® = 2 : prefers smooth distributions
Summary so far • We defined a generative model of document collections: • Now, first we will consider how to do MAP estimation, i.e. • Then we will consider Bayesian methods Expectation maximization algorithm
Expectation Maximization • EM is a class of algorithms that is used to estimate parameters in the presence of missing attributes. • Non-convex optimization, therefore: • converges to a local maximum of the maximum likelihood function (or posterior distribution if we incorporate the prior distribution). • can be very sensitive to the starting point In LDA we do not observe from each topic z a word is generated
Three coin example • We observe a series of coin tosses generated in the following way: • A person has three coins. • Coin 0: probability of Head is ® • Coin 1: probability of Head p • Coin 2: probability of Head q • Consider the following coin-tossing scenario
Three coin example • Scenario: • Toss coin 0 (do not show it to anyone!). • If Head – toss coin 1 M times; • Else -- toss coin 2 M times. • Only the series of tosses are observed • HHHT, HTHT, HHHT, HTTH • What are the parameters of the coins ? (®, p, q)
Three coin example • Scenario: • Toss coin 0 (do not show it to anyone!). • If Head – toss coin 1 M times; • Else -- toss coin 2 M times. • Only the series of tosses are observed • HHHT, HTHT, HHHT, HTTH • What are the parameters of the coins ? (®, p, q) • There is no closed form solution to the problem
Key Intuition • If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, that would be trivial.
Key Intuition • If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, there was no problem. • Instead, use an iterative approach for estimating the parameters: • Guess the probability that a data point came from Coin 1/2 • Generate fictional labels, weighted according to this probability. • Re-estimate the initial parameter setting: set them to maximize the likelihood of these augmented data. • This process can be iterated and can be shown to converge to a local maximum of the likelihood function
EM-algorithm: coins (E-step) • We will assume (for a minute) that we know the parameters and use it to estimate which Coin it is • Then, we will use the estimation for the tossed Coin, to estimate the most likely parameters and so on... • What is the probability that the ith data point came from Coin1 ?
EM-algorithm: coins (M-step) • At this point we would like to compute the likelihood of the data, and find the parameters which maximize it • We will maximize the likelihood of the data (n data points) • But one of the variables: the coin name is hidden: • Instead on each step we maximize expectation of the likelihood over the coin name:
EM-algorithm: coins (M-step) • Continue:
EM-algorithm: coins • Now fine the most likely parameters by setting derivatives to zero: Note that are fixed (computed from the previous estimate of parameters)