BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING

BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING Shuanhu Bai and Haizhou Li Institute for Infocomm Research, Republic of Singapore

Outline • Introduction • N-gram Model • Bayesian Learning • QB Estimation for Incremental Learning • Continuous N-gram Model • Bayesian Learning • QB Estimation for Incremental Learning • Experimental Results • Conclusions

Introduction • Assuming ample training data, the n-gram language models are still far from optimal • Studies show that they are extremely sensitive to changes in the style, topic or genre • LM adaptation aims at bridging the mismatch between the models and the test domain • A typical n-gram LM is trained under maximum likelihood estimation (MLE) criterion

Introduction (cont.) • One typical adaptation technique is called deleted interpolation which combines the flat, reliable general model (baseline model) with the sharp, volatile domain specific model • In this paper, we will study the Bayesian learning formulation for n-gram LM adaptation • Under the Bayesian learning framework, an incremental adaptation procedure is also proposed for dynamically updating of cache-based n-gram

N-gram Model • N-gram model • The quality of a given n-gram LM on a corpus D of size T is commonly assessed by the log-likelihood probability • Unigram & Bigram

N-gram Model (cont.) • MLE • Smoothing • Backoff • cache

Bayesian Learning for N-gram Model • Dirichlet • The probability of generating a text corpus is obtained by integrating over the parameter space • MAP

QB Estimation for Incremental Learning for N-gram Model • It is of practical use to devise such incremental learning mechanism that adapts both parameters and the prior knowledge over time • Sub-corpus Dn={D1,D2,…,Dn} • The updating of parameters can be iterated between the reproducible prior and posterior estimates

ML • MAP • QB

Continuous N-gram Model • Continuous n-gram model is also called aggregate Markov model • We introduce Z hidden variable as the “soft” word classes • Z=1-> unigram, Z=I -> bigram • The continuous bigram model has two obviously advantages over the discrete bigram: • Parameters : I x I -> I X Z X 2 • Can apply EM to estimate parameters under MLE criterion

Continuous N-gram Model (cont.) • Parameters

Bayesian Learning for Continuous N-gram Model • Prior • After EM algorithm • Can be interpreted as a smoothing between the known priors and the current observations, or cache corpus

QB Estimation for Incremental Learning for continuous N-gram Model • Updating of parameters • Initial parameters

Experimental Results • Corpus • A: 60 million words from LDC98T30 of finance and business • B: 20 million words from LDC98T30 of sports and fashion for incremental training • C: A+B for adaptation • D: 20 million words in the same domain of C (open test set) • Vocabulary: 50,000 words from A + B

Experimental Results (cont.)

Conclusions • Propose a Bayesian learning approach to n-gram modeling • an interpretation for the smoothing or adaptation of language model as a weighting between prior knowledge and current observations • The Dirichlet conjugate prior not only leads to a batch adaptation • procedure but also a quasi-Bayes incremental learning strategy for on-line language modeling

BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING

BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING

Presentation Transcript

N-gram Models

N-Gram Language Models

Bayesian Learning and Learning Bayesian Networks

n-gram analysis

Smoothing N-gram Language Models

Chapter6. Statistical Inference : n-gram Model over Sparse Data

Statistical language modeling combining n-gram and dependency grammar

Bayesian Learning

Discriminative n-gram language modeling

Bayesian Generative Modeling

N-gram Models

Statistical Language Modeling using SRILM Toolkit

Semantic n-gram language modeling with the latent maximum entropy principle

Language Modeling with N-Grams

Bayesian Learning

Statistical Modeling

Statistical approaches to language learning

N-gram Models

Bayesian Generative Modeling

CS 388: Natural Language Processing: N-Gram Language Models

Generative (Bayesian) modeling

Statistical Learning Introduction: Modeling Examples