200 likes | 233 Views
Language Model. Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of candidates can be eliminated and it is possible to give other words higher probabilities. LM.
E N D
Language Model Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of candidates can be eliminated and it is possible to give other words higher probabilities.
LM This lets the recognizer make the right guess when two different sentences Sound the same. For example: • It’s fun to recognize speech? • It’s fun to wreck a nice beach?
LM The Bayesian rule: To maximize we look today at P(W)
LM Ultimate goal is that a speech recognizer performs a good as human being. In psychology a lot of research has been done. • The *eel was on the shoe • The *eel was on the car People capable to adjusting to right context • removes ambiguities • limits possible words Already very good language models for dedicated applications (e.g. medical, a lot of standardization)
classification Language models used in speech recognition can be classified into the following categories: • Uniform models: the chance a word occurs is 1 / V. V is the size of the vocabulary • Finite state machines • Grammar models: they use context free grammars • Stochastic models: they determine the chance of a word on it’s preceding words (eg n-grams)
CFG A grammar is defined by: G = (V, T, P, S) where:V contains the set of all non-terminal symbols. T contains the set of all terminal symbols. P is a set of production or production rules. S is a special symbol called the start symbol. Example of rules: S -> NP VP VP -> V NPNP -> NOUNNP -> NAMENOUN -> speechNAME -> Julie Ethan VERB -> loves chases
CFG Parsing • bottom up where you start with the input sentence and try to reach the start symbol • Top down, you start with the starting symbol and try to reach the input sentence by applying the appropriate rules. Left recursion is a problem. (A -> Aa) Advantage bottom up: “What is the weather forecast for this afternoon?” A lot of parsing algorithms available from computer science Problem: people don’t follow the rules of grammar strictly, especially in spoken language. Creating a grammar that covers all this constructions is unfeasible.
probabilistic CFG A mixture between formal language and probabilistic models is the PCFG If there are m rules for left-hand side non terminal node Then probability of these rules is Where C denotes the number of times each rule is used.
Stochastic language models In formal language theory P(W) can be regarded as 1 if the word sequence is accepted or as 0 if it is rejected. N-grams: The probability that wi will follow, given that the word sequence was presented previously
N-grams Unigram: Bigram: Trigram:
gram example To calculate this probability, we need to compute both the number of times "am" is preceded by "I" and the number of times "here" is preceded by "I am." All four sounds the same, right decision can only be made by language model.
training Training is done by very large training sets with millions of words. Still a lot of legal word sequences won’t be considered during the training. Because it is unfeasible to train on every possible sequence of words, it will occur that for legal sequences P(W) is zero.
training Solutions to overcome this problem • A practical approach is to assume this probability depends only on an equivalence class. For example, group all nouns in an equivalence class. • A technique called smoothing adjusts very low and very high probabilities. So 0 en 1 won’t occur anymore.
evaluation The most common metric for a LM is looking at the word recognition error rate. This requires a complete SR system. Another method is known as perplexity
perplexity Encode text W using –2logP(W) bits. Then the cross-entropy H(W) is: Where N is the length of the text. The perplexity is then defined as:
example Training set: • John read her book • I read a different book • John read a book by Mulan
example These bigram probabilities help us estimate the probability for the sentence as: P(John read a book) = P(John|<s>)P(read|John)P(book|a)P(</s>|book) = 0.148 Then cross entropy: -1/4*2log(0.148) = 0.689 So perplexity = 20.689 = 1.61 Comparison: Wall street journal text (5000 words) has a bigram perplexity of 128
evalutation High perplexity means that the number of words branching from a previous word is larger on average. Low perplexity does not guarantee good performance. For example B,C,D,E,G,P,T has 7 but does not take into account acoustic confusability.