Language Models For Speech Recognition

Language Models For Speech Recognition

Speech Recognition • : sequence of acoustic vectors • Find the word sequence so that: • The task of a language model is to make available to the recognizer adequate estimates of the probabilities

Language Models

N-gram models • Make the Markov assumption that only the prior local context – the last (N-1) words – affects the next word • N=3 trigrams • N=2 bigrams • N=1 unigrams

Parameter estimation Maximum Likelihood Estimator • N=3 trigrams • N=2 bigrams • N=1 unigrams • This will assign zero probabilities to unseen events

Number of Parameters • For a vocabulary of size V, a 1-gram model has V-1 independent parameters • A 2-gram model has V2-1 independent parameters • In general, an n-gram model has Vn-1 independent parameters Typical values for a moderate size vocabulary of 20000 words are:

Number of Parameters • |V|=60.000 N=35M Eleftherotypia daily newspaper • In a typical training text, roughly 80% of trigrams occur only once Good-Turing estimate: ML estimates will be zero for 37.5% of the 3-grams and for 11% of the 2-grams

Problems • Data sparseness: we have not enough data to train the model parameters Solutions • Smoothing techniques: accurately estimate probabilities in the presence of sparse data • Good-Turing, Jelinek-Mercer (linear interpolation), Katz (backing-off) • Build compact models: they have fewer parameters to train and thus require less data • equivalence classification of words (e.g. grammatical rules (noun, verb, adjective, preposition), semantic labels (city, name, date))

Smoothing • Make distributions more uniform • Redistribute probability mass from higher to lower probabilities

Additive Smoothing • For each n-gram that occurs r times, pretend that it occurs r+1 times • e.g bigrams

Good-Turing Smoothing • For any n-gram that occurs r times, pretend that it occurs r* times is the number of n-grams which occurs r times • To convert this count to a probability we just normalize • Total probability of unseen n-grams

Example

Jelinek-Mercer Smoothing(linear interpolation) • Good-Turing • Intuitively • Interpolate a higher-order model with a lower-order model • Given fixed pML, it is possible to search efficiently for the λ that maximize the probability of some data using the Baum-Welch algorithm

Katz Smoothing (backing-off) • For those events which wave been observed in the training data we assume some reliable estimate of the probability • For the remaining unseen events we back-off to some less specific distribution • is chosen so that the total probability sums to 1

Witten-Bell Smoothing • Model the probability of new events, estimating the probability of seeing such a new event as we proceed through the training corpus (i.e. the total number of word types in the corpus)

Absolute Discounting • Subtract a constant D from each nonzero count

Kneser-Ney • Lower order distribution not proportional to to the number of occurrences of a word, but to the number of different words that it follows

Modified Kneser-Ney

Measuring Model Quality • Consider the language as an information source L, which emits a sequence of symbols wi from a finite alphabet (the vocabulary) • The quality of a language model M can be judged by its cross entropy with regard to the distribution PT(x) of some hitherto unseen text T: • Intuitively speaking cross entropy is the entropy of T as “perceived” by the model M

Perplexity • Perplexity: • In a language with perplexity X, every word can be followed be X different words with equal probabilities

Elements of Information Theory • Entropy • Mutual Information pointwise • Kullback-Leiblel (KL) divergence

The Greek Language • Highly inflectional language • A Greek vocabulary of 220K words is needed in order to achieve 99.6% lexical coverage

Perplexity

Experimental Results

Hit Rate

Class-based Models • Some words are similar to other words in their meaning and syntactic function • Group words into classes • Fewer parameters • Better estimates

Class-based n-gram models • Suppose that we partition the vocabulary into G classes • This model produces text by first generating a string of classes g1,g2,…,gn and then converting them into the words wi, i=1,2,…n with probability p(wi|gi) • An n-gram model has Vn-1 independent parameters (216x1012) • A class-based model has Gn-1+V-G parameters ( 109 ) Gn-1of an n-gram model for a vocabulary of size G V-G of the form p(wi|gi)

Relation to n-grams

Defining Classes • Manually • Use part-of-speech labels by linguistic experts or a tagger • Use stem information • Automatically • Cluster words as part of an optimization method e.g. Maximize the log-likelihood of test text

Agglomerative Clustering • Bottom-up clustering • Start with a separate cluster for each word • Merge that pair for which the loss in average MI is least

Example • Syntactical classes • verbs, past tense: άναψαν, επέλεξαν, κατέλαβαν, πλήρωσαν, πυροβόλησαν • nouns, neuter: άλογο, δόντι, δέντρο, έντομο, παιδί, ρολόι, σώμα • Adjectives, masculine:δημοκρατικός, δημόσιος, ειδικός, εμπορικός, επίσημος • Semantic classes • last names: βαρδινογιάννης, γεννηματάς, λοβέρδος, ράλλης • countries: βραζιλία, βρετανία, γαλλία, γερμανία, δανία • numerals: δέκατο, δεύτερο, έβδομο, εικοστό, έκτο, ένατο, όγδοο • Some not so well defined classes • ανακριβής, αναμεταδίδει, διαφημίσουν, κομήτες, προμήθευε • εξίσωση, έτρωγαν, και, μαλαισία, νηπιαγωγών, φεβρουάριος

Stem-based Classes • άγνωστ: άγνωστος, άγνωστου, άγνωστο, άγνωστον, άγνωστοι, άγνωστους, άγνωστη, άγνωστης, άγνωστες, άγνωστα, • βλέπ: βλέπω, βλέπεις, βλέπει, βλέπουμε, βλέπετε, βλέπουν • εκτελ: εκτελεί, εκτελούν, εκτελούσε, εκτελούσαν, εκτελείται, εκτελούνται • εξοχικ: εξοχικό, εξοχικά, εξοχική, εξοχικής, εξοχικές • ιστορικ: ιστορικός, ιστορικού, ιστορικό, ιστορικοί, ιστορικών, ιστορικούς, ιστορική, ιστορικής, ιστορικές, ιστορικά • καθηγητ: καθηγητής, καθηγητή, καθηγητές, καθηγητών • μαχητικ: μαχητικός, μαχητικού, μαχητικό, μαχητικών, μαχητική, μαχητικής, μαχητικά

Example • Interpolate class-based and word-based models

Hit Rate

Where do we go from here? • Use syntactic information The dog on the hill barked • Constraints

Language Models For Speech Recognition