490 likes | 520 Views
Explore the application of Statistical Language Modeling in speech recognition, information retrieval, and document summarization. Learn about n-Gram models and optimization techniques for language model probabilities.
E N D
Statistical Language Modeling for Speech Recognition and Information Retrieval Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University
Outline • What is Statistical Language Modeling • Statistical Language Modeling for Speech Recognition, Information Retrieval, and Document Summarization • Categorization of Statistical Language Models • Main Issues for Statistical Language Models • Conclusions
What is Statistical Language Modeling ? • Statistical language modeling (LM), which aims to capture the regularities in human natural language and quantify the acceptance of a given word sequence Adopted from Joshua Goodman’s public presentation file
What is Statistical LM Used for ? • It has continuously been a focus of active research in the speech and language community over the past three decades • It also has been introduced to the information retrieval (IR) problems, and provided an effective and theoretically attractive probabilistic framework for building IR systems • Other application domains • Machine translation • Input method editor (IME) • Optical character recognition (OCR) • Bioinformatics • etc.
Statistical LM for Speech Recognition • Speech recognition: finding a word sequence out of the many possible word sequences that has the maximum posterior probability given the input speech utterance Language Modeling Acoustic Modeling
Statistical LM for Information Retrieval • Information retrieval (IR): identifying information items or “documents" within a large collection that best match (are most relevant to) a “query” provided by a user that describes the user’s information need • Query-likelihood retrieval model: a query is considered generated from an “relevant” document that satisfies the information need • Estimate the likelihood of each document in the collection being the relevant document and rank them accordingly Query Likelihood Document Prior Probability
Statistical LM for Document Summarization • Estimate the likelihood of each sentence of the document being in the summary and rank them accordingly • The sentence generative probability can be taken as a relevance measure between the document and sentence • The sentence prior probability is, to some extent, a measure of the importance of the sentence itself Sentence Prior Probability Sentence Generative Probability (or Document Likelihood)
History of wi n-Gram Language Models (1/3) • For a word sequence , can be decomposed into a product of conditional probabilities • E.g., • However, it’s impossible to estimate and store if is large (the curse of dimensionality) multiplication rule
n-Gram Language Models (2/3) • n-gram approximation • Also called (n-1)-order Markov modeling • The most prevailing language model • E.g., trigram modeling • How do we find probabilities? (maximum likelihood estimation) • Get real text, and start counting (empirically) History of length n-1 Probability may be zero count
n-Gram Language Models (3/3) • Minimum Word Error (MWE) Discriminative Training • Given a training set of observation sequences , the MWE criterion aims to minimize the expected word errors of these observation sequences using the following objective function • MWE objective function can be optimized with respect to the language model probabilities using Extended Baum-Welch (EBW) algorithm
Query n-Gram-Based Retrieval Model (1/2) • Each document is a probabilistic generative model consisting of a set of n-gram distributions for predicting the query • Document models can be optimized by the expectation-maximization (EM) or minimum classification error (MCE) training algorithms, given a set of query and relevant document pairs • Features: • 1. A formal mathematic framework • 2. Use collection statistics but not heuristics • 3. The retrieval system can be gradually improved through usage
n-Gram-Based Retrieval Model (2/2) • MCE training • Given a query and a desired relevant doc , define the classification error function as: “>0”: means misclassified; “<=0”: means a correct decision • Transform the error function to the loss function • Iteratively update the weighting parameters, e.g., Also can take all irrelevant doc in the answer set into consideration
n-Gram-Based Summarization Model (1/2) • Each sentence of the spoken document is treated as a probabilistic generative model of n-grams, while the spoken document is the observation • : the sentence model, estimated from the sentence • : the collection model, estimated from a large corpus • In order to have some probability of every word in the vocabulary
n-Gram-Based Summarization Model (2/2) • Relevance Model (RM) • In order to improve the estimation of sentence models • Each sentence has its own associated relevance model, constructed by the subset of documents in the collection that are relevant to the sentence • The relevance model is then linearly combined with the original sentence model to form a more accurate sentence model
Categorization of Statistical Language Models (2/4) 1. Word-based LMs • The n-gram model is usually the basic model of this category • Many other models of this category are designed to overcome the major drawback of n-gram models • That is, to capture long-distance word dependence information without increasing the model complexity rapidly • E.g., mixed-order Markov model and trigger-based language model, etc. 2. Word class (or topic)-based LMs • These models are similar to the n-gram model, but the relationship among words is constructed via (latent) word classes • When the relationship is established, the probability of a decoded word given the history words can be easily found out • E.g., class-based n-gram model, aggregate Markov model and word topical mixture model (WTMM)
Categorization of Statistical Language Models (3/4) 3. Structure-based LMs • Due to the constraints of grammars, rules for a sentence may be derived and represented as a parse tree • Then, we can select among candidate words by the sentence patterns or head words of the history • E.g., structured language model 4. Document class (or topic)-based LMs • Words are aggregated in a document to represent some topics (or concepts). During speech recognition, the history is considered as an incomplete document and the associated latent topic distributions can be discovered on the fly • The decoded words related to most of the topics that the history probably belongs to can be therefore selected • E.g., mixture-based language model, probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA)
Categorization of Statistical Language Models (4/4) • Ironically, the most successful statistical language modeling techniques use very little knowledge of what language is • may be a sequence of arbitrary symbols, with no deep structure, intention, or thought behind them • F. Jelinek said “put language back into language modeling” • “Closing remarks” presented at the 1995 Language Modeling Summer Workshop, Baltimore
Main Issues for Statistical Language Models • Evaluation • How can you tell a good language model from a bad one • Run a speech recognizer or adopt other statistical measurements • Smoothing • Deal with data sparseness of real training data • Various approaches have been proposed • Adaptation • The subject matters and lexical characteristics for the linguistic contents of utterances or documents (e.g., news articles) might be are very diverse and are often changing with time • LMs should be adapted consequently • Caching: If you say something, you are likely to say it again later • Adjust word frequencies observed in the current conversation
Evaluation (1/7) • Two most common metrics for evaluation a language model • Word Recognition Error Rate (WER) • Perplexity (PP) • Word Recognition Error Rate • Requires the participation of a speech recognition system(slow!) • Need to deal with the combination of acoustic probabilities and language model probabilities (penalizing or weighting between them)
Evaluation (2/7) • Perplexity • Perplexity is geometric average inverse language model probability (measure language model difficulty, not acoustic difficulty/confusability) • Can be roughly interpreted as the geometric mean of the branching factor of the text when presented to the language model • For trigram modeling:
Evaluation (3/7) • More about Perplexity • Perplexity is an indication of the complexity of the language if we have an accurate estimate of • A language with higher perplexity means that the number of words branching from a previous word is larger on average • A langue model with perplexity L has roughly the same difficulty as another language model in which every word can be followed by L different words with equal probabilities • Examples: • Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10 • Ask a speech recognizer to recognize names at a large institute (10,000 persons) – hard – perplexity 10,000
Evaluation (4/7) • More about Perplexity (Cont.) • Training-set perplexity: measures how the language model fits the training data • Test-set perplexity: evaluates the generalization capability of the language model • When we say perplexity, we mean “test-set perplexity”
Evaluation (5/7) • Is a language model with lower perplexity is better? • The true (optimal) model for data has the lowest possible perplexity • The lower the perplexity, the closer we are to the true model • Typically, perplexity correlates well with speech recognition word error rate • Correlates better when both models are trained on same data • Doesn’t correlate well when training data changes • The 20,000-word continuous speech recognition for Wall Street Journal (WSJ) task has a perplexity about 128 ~ 176 (trigram) • The 2,000-word conversational Air Travel Information System (ATIS) task has a perplexity less than 20
Evaluation (6/7) • The perplexity of bigram with different vocabulary size
Evaluation (7/7) • A rough rule of thumb (recommended by Rosenfeld) • Reduction of 5% in perplexity is usually not practically significant • A 10% ~ 20% reduction is noteworthy, and usually translates into some improvement in application performance • A perplexity improvement of 30% or more over a good baseline is quite significant Perplexity cannot always reflect the difficulty of a speech recognition task Tasks of recognizing 10 isolated-words using IBM ViaVoice
Smoothing (1/3) • Maximum likelihood (ML) estimate of language models has been shown previously, e.g.: • Trigam probabilities • Bigram probabilities count
Smoothing (2/3) • Data Sparseness • Many actually possible events (word successions) in the test set may not be well observed in the training set/data • E.g. bigram modeling P(read|Mulan)=0 P(Mulan read a book)=0 P(W)=0 P(X|W)P(W)=0 • Whenever a string such that occurs during speech recognition task, an error will be made
Smoothing (3/3) • Smoothing • Assign all strings (or events/word successions) a nonzero probability if they never occur in the training data • Tend to make distributions flatter by adjusting lower probabilities upward and high probabilities downward
Smoothing: Simple Models • Add-one smoothing • For example, pretend each trigram occurs once more than it actually does • Add delta smoothing Work badly! DO NOT DO THESE TWO.
Smoothing: Back-Off Models • The general form for n-gram back-off • : normalizing/scaling factor chosen to make the conditional probability sum to 1 • I.e., n-gram smoothed (n-1)-gram
count Smoothing: Interpolated Models • The general form for Interpolated n-gram back-off • The key difference between backoff and interpolated models • For n-grams with nonzero counts, interpolated models use information from lower-order distributions while back-off models do not • Moreover,in interpolated models, n-grams with the same counts can have different probability estimates
Caching (1/2) • The basic idea of cashing is to accumulate n-grams dictated so far in the current document/conversation and use these to create dynamic n-grams model • Trigram interpolated with unigram cache • Trigram interpolated with bigram cache
Caching (2/2) • Real Life of Caching • Someone says “I swear to tell the truth” • System hears “I swerve to smell the soup” • Someone says “The whole truth”, and, with cache, system hears “The toll booth.” – errors are locked in • Caching works well when users correct as they go, poorly or even hurts without corrections Cache remembers! Adopted from Joshua Goodman’s public presentation file
LM Integrated into Speech Recognition • Theoretically, • Practically, language model is a better predictor while acoustic probabilities aren’t “real” probabilities • Penalize insertions • E.g.,
Known Weakness in Current n-Gram LM • Brittleness Across Domain • Current language models are extremely sensitive to changes in the style or topic of the text on which they are trained • E.g., conversations vs. news broadcasts, fictions vs. politics • Language model adaptation • In-domain or contemporary text corpora/speech transcripts • Static or dynamic adaptation • Local contextual (n-gram) or global semantic/topical information • False Independence Assumption • In order to remain trainable, the n-gram modeling assumes the probability of next word in a sentence depends only on the identity of last n-1 words • n-1-order Markov modeling
n-Gram Language Model Adaptation (1/4) • Count Merging • n-gram conditional probabilities form a a multinominal distribution • The parameters form sets of independent Dirichlet distributions with hyperparameters • The MAP estimate is the posterior distribution of Vocabulary Size All possible N-gram histories
n-Gram Language Model Adaptation (2/4) • Count Merging (cont.) • Maximize the posterior distribution of w.r.t. the constraint • Differentiate w.r.t. Largrange Multiplier
n-Gram Language Model Adaptation (3/4) • Count Merging (cont.) • Parameterization of the prior distribution (I): • The adaptation formula for Count Merging • E.g., Background Corpus Adaptation Corpus
n-Gram Language Model Adaptation (4/4) • Model Interpolation • Parameterization of the prior distribution (II): • The adaptation formula for Model Interpolation • E.g.,
Class-based Bigram Model • (Hidden Markov models for) Class-based bigram model • Nondeterministic class assignment • Deterministic class assignment • Estimation of class bigram and word unigram probabilities Jelink et al., 1992 graphical model representation
Aggregate Markov Model (1/2) • An alternative approach for class-based bigram LMs • Models trained by maximizing the log-likelihood of the training corpus graphical model representation Saul & Pereira, 1997
Aggregate Markov Model (2/2) • Model Training Using the EM algorithm • Expectation • Maximization
Mixed-order Markov Model (1/2) • Probability distribution • Combine skip-k transition matrix • Can be viewed as a coin toss process wt … wt-m wt-3 wt-2 wt-1 At position 1, the probability = At position 2, the probability = At position 3, the probability = At position m, the probability = Note:
Mixed-order Markov Model (2/2) • Model Training Using the EM algorithm • Expectation • Maximization The raw counts of k-separated bigrams do give good initial estimates
Conclusions • Statistical language modeling has demonstrated to be an effective probabilistic framework for NLP, ASR, and IR-related applications • There remains many issues to be solved for statistical language modeling, e.g., • Unknown word (or spoken term) detection • Discriminative training of language models • Adaptation of language models across different domains and genres • Fusion of various (or different levels of) features for language modeling • Positional Information? • Rhetorical (structural) Information?
References • J. R. Bellegarda. Statistical language model adaptation: review and perspectives. Speech Communication 42(11), 93-108, 2004 • X., W. Liu, B. Croft. Statistical language modeling for information retrieval. Annual Review of Information Science and Technology 39, Chapter 1, 2005 • R. Rosenfeld. Two decades of statistical language modeling: where do we go from here? Proceedings of IEEE, August 2000 • J. Goodman, A bit of progress in language modeling, extended version. Microsoft Research Technical Report MSR-TR-2001-72, 2001 • H.S. Chiu, B. Chen. Word topical mixture models for dynamic language model adaptation. ICASSP2007 • J.W. Kuo, B. Chen. Minimum word error based discriminative training of language models. Interspeech2005 • B. Chen, H.M. Wang, L.S. Lee. Spoken document retrieval and summarization. Advances in Chinese Spoken Language Processing, Chapter 13, 2006
Maximum Likelihood Estimate (MLE) for n-Grams (1/2) • Given a a training corpus T and the language model • Essentially, the distribution of the sample counts with the same history referred as a multinominal (polynominal) distribution N-grams with same history are collected together …陳水扁 總統 訪問 美國 紐約 … 陳水扁 總統 在 巴拿馬 表示 … P(總統|陳水扁)=?
Maximum Likelihood Estimate (MLE) for n-Grams (2/2) • Take logarithm of , we have • For any pair , try to maximize and subject to