Style & Topic Language Model Adaptation Using HMM-LDA

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass

Outline • Introduction • LDA • HMM-LDA • Experiments • Conclusions

Introduction • An effective LM needs to not only account for the casual speaking style of lectures but also accommodate the topic-specific vocabulary of the subject matter • Available training corpora rarely match the target lecture in both style and topic • In this paper, the syntactic state and semantic topic assignment are investigated using HMM with LDA model

LDA • A generative probabilistic model of a corpus • The topic mixture is drawn from a conjugate Dirichlet prior • PLSA • LDA • Model parameters

Markov chain Monte Carlo • A class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its stationary distribution • The most common application of these algorithms is numerically calculating multi-dimensional integrals • an ensemble of "walkers" moves around randomly • A Markov chain is constructed in such a way as to have the integrand as its equilibrium distribution

LDA • Estimate posteriori • Integrating out: • Gibbs sampling

Markov chain Monte Carlo (cont.) • Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling

HMM+LDA • HMMs generate documents purely based on syntactic relations among unobserved word classes • Short-range dependencies • Topic model generate documents based on semantic correlations between words, independent of word order • long-range dependencies • A major advantage of generative models is modularity • Different models are easily combined • Words are exhibited by Mixture of model & product of model • Only a subset of words, content words, exhibit long-range dependencies • Replace one probability distribution over words used in syntactic model with the semantic model

HMM+LDA (cont.) • Notation: • A sequence of words • A sequence of topic assignments • A sequence of classes • means semantic class • zth topic associated with distribution over words • Each class is associated with distribution over words • Each document has a distribution over topic • Transition between class and follows a distribution

HMM+LDA (cont.) • A document is generated: • Sample from a prior • For each word in document • Draw from • Draw from • If , then draw from ,else draw from

HMM+LDA (cont.) • Inference • are drawn from • are drawn from • The row of the transition matrix are drawn from • are drawn from • Assume all Dirichlet distribution are symmetric

HMM+LDA (cont.) • Gibbs Sampling

HMM-LDA Analysis • Lectures Corpus • 3 undergraduate subject in math, physics, computer science • 10 CS lectures for development set, 10 CS lectures for test set • Textbook Corpus • CS course textbook • divided in to 271 topic-cohesive documents at every section heading • Run Gibbs sampler against the two dataset • L: 2,800 iterations, T: 2,000 iterations • Use lowest perplexity model as the final model

HMM-LDA Analysis (cont.) • Semantic topics (Lectures) Magnetism Machine learning Childhood Memories Linear Algebra • <laugh>: cursory examination of the data suggests that speakers talking about children tend to laugh more during the lecture • Although it may not be desirable to capture speaker idiosyncrasies in the topic mixtures, HMM-LDA has clearly demonstrated its ability to capture distinctive semantic topics in a corpus

HMM-LDA Analysis (cont.) • Semantic topics (Textbook) • A topically coherent paragraph • 6 of the 7 instances of the words “and” and “or” (underline) are correctly classified • Multi-word topic key phrases can be identified for n-gram topic models the context-dependent labeling abilities of the HMM-LDA models is demonstrated

HMM-LDA Analysis (cont.) • Syntactic States (Lectures) • State 20 is topic state Prepositions Conjunctions Verbs Hesitation disfluencies • As demonstrated with spontaneous speech, HMM-LDA yields syntactic states that have a good correspondence to part-of speech labels, without requiring any labeled training data

Discussions • Although MCMC techniques converge to the global stationary distribution, we cannot guarantee convergence from observation of the perplexity alone • Unlike EM algorithms, random sampling may actually temporarily decrease the model likelihood • The number of iteration was chosen to be at least double the point at which the PP first appeared to converge

Language Modeling Experiments • Baseline model: Lecture + Textbook Interpolated trigram model (using modified Kneser-Ney discounting) • Topic-deemphasized style (trigram) model (Lectures): • To deemphasize the observed occurrences of topic words and ideally redistribute these counts to all potential topic words • The counts of topic to style word transitions are not altered

Language Modeling Experiments (cont.) • Textbook model should ideally have higher weight in the contexts containing topic words • Domain trigram model (Textbook): • Emphasize the sequences containing a topic word in the context by doubling their counts

Language Modeling Experiments (cont.) • unsmoothed topical tirgram model: • Apply HMM-LDA with 100 topics to identify representative words and their associated contexts for each topics • Topic mixtures for all models • Mixture weights were tuned on individual target lectures (cheat) • 15 of 100 topics account for over 90% of the total weight

Language Modeling Experiments (cont.) • Since the topic distribution shifts over a long lecture, modeling a lecture with fixed weights may not be the most optimal • Update the mixture distribution by linearly interpolating it with the posterior topic distribution given the current word

Language Modeling Experiments (cont.) • The variation of topic mixtures Review previous lecture -> Show an example of computation using accumulators -> Focus the lecture on stream as a data structure, with an intervening example that finds pairs of i and j that sum up to a prime

Language Modeling Experiments (cont.) • Experimental results

Conclusions • HMM-LDA shows great promise for finding structure in unlabeled data, from which we can build more sophisticated models • Speaker-specific adaptation will be investigated in the future

Style & Topic Language Model Adaptation Using HMM-LDA