160 likes | 365 Views
BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING. Shuanhu Bai and Haizhou Li Institute for Infocomm Research, Republic of Singapore. Outline. Introduction N-gram Model Bayesian Learning QB Estimation for Incremental Learning Continuous N-gram Model Bayesian Learning
E N D
BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING Shuanhu Bai and Haizhou Li Institute for Infocomm Research, Republic of Singapore
Outline • Introduction • N-gram Model • Bayesian Learning • QB Estimation for Incremental Learning • Continuous N-gram Model • Bayesian Learning • QB Estimation for Incremental Learning • Experimental Results • Conclusions
Introduction • Assuming ample training data, the n-gram language models are still far from optimal • Studies show that they are extremely sensitive to changes in the style, topic or genre • LM adaptation aims at bridging the mismatch between the models and the test domain • A typical n-gram LM is trained under maximum likelihood estimation (MLE) criterion
Introduction (cont.) • One typical adaptation technique is called deleted interpolation which combines the flat, reliable general model (baseline model) with the sharp, volatile domain specific model • In this paper, we will study the Bayesian learning formulation for n-gram LM adaptation • Under the Bayesian learning framework, an incremental adaptation procedure is also proposed for dynamically updating of cache-based n-gram
N-gram Model • N-gram model • The quality of a given n-gram LM on a corpus D of size T is commonly assessed by the log-likelihood probability • Unigram & Bigram
N-gram Model (cont.) • MLE • Smoothing • Backoff • cache
Bayesian Learning for N-gram Model • Dirichlet • The probability of generating a text corpus is obtained by integrating over the parameter space • MAP
QB Estimation for Incremental Learning for N-gram Model • It is of practical use to devise such incremental learning mechanism that adapts both parameters and the prior knowledge over time • Sub-corpus Dn={D1,D2,…,Dn} • The updating of parameters can be iterated between the reproducible prior and posterior estimates
ML • MAP • QB
Continuous N-gram Model • Continuous n-gram model is also called aggregate Markov model • We introduce Z hidden variable as the “soft” word classes • Z=1-> unigram, Z=I -> bigram • The continuous bigram model has two obviously advantages over the discrete bigram: • Parameters : I x I -> I X Z X 2 • Can apply EM to estimate parameters under MLE criterion
Continuous N-gram Model (cont.) • Parameters
Bayesian Learning for Continuous N-gram Model • Prior • After EM algorithm • Can be interpreted as a smoothing between the known priors and the current observations, or cache corpus
QB Estimation for Incremental Learning for continuous N-gram Model • Updating of parameters • Initial parameters
Experimental Results • Corpus • A: 60 million words from LDC98T30 of finance and business • B: 20 million words from LDC98T30 of sports and fashion for incremental training • C: A+B for adaptation • D: 20 million words in the same domain of C (open test set) • Vocabulary: 50,000 words from A + B
Conclusions • Propose a Bayesian learning approach to n-gram modeling • an interpretation for the smoothing or adaptation of language model as a weighting between prior knowledge and current observations • The Dirichlet conjugate prior not only leads to a batch adaptation • procedure but also a quasi-Bayes incremental learning strategy for on-line language modeling