Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies Jun Wu Advisor: Sanjeev Khudanpur Department of Computer Science Johns Hopkins University Baltimore, MD 21218 April, 2001 NSF STIMULATE Grant No. IRI-9618874 Center for Language and Speech Processing, The Johns Hopkins University.

Outline • Language modeling in speech recognition • The maximum entropy (ME) principle • Semantic (Topic) dependencies in natural language • Syntactic dependencies in natural language • ME models with topic and syntactic dependencies • Conclusion and future work • Topic assignment during test (15min) • Role of syntactic head (15min) • Training ME models in an efficient way (1 hour) Center for Language and Speech Processing, The Johns Hopkins University.

Motivation Example: • A research team led by two Johns Hopkins scientists ___ found the strongest evidence yet that a virus may …... • have • has • his Center for Language and Speech Processing, The Johns Hopkins University.

Language Models in Speech Recognition • Role of language models Center for Language and Speech Processing, The Johns Hopkins University.

Language Modeling in Speech Recognition • N-gram models • In practice, N=1,2,3,or 4. Even these values of N pose data sparseness problem. For , a trigram model has free parameters. There are millions of unseen bigrams and billions of unseen trigrams for which we need an estimate of the probability . Center for Language and Speech Processing, The Johns Hopkins University.

Smoothing Techniques • Relative frequency estimates: • Deleted Interpolation: Jelinek, et al. 1980 • Back-off: Katz 1987, Witten-Bell 1990, Ney, et al. 1994. Center for Language and Speech Processing, The Johns Hopkins University.

Measuring the Quality of Language Models • Word Error Rate: • Reference: The contract ended with a loss of *** seven cents. • Hypothesis: A contract ended with * loss of some even cents. • Scores: S C C C D C C I S C • Perplexity: • Perplexity measures the average number of words that can follow a given history under a language model. å = - H ( P ) P ( W ) log P ( W ) P L L W = ( ) H P PPL 2 P L Center for Language and Speech Processing, The Johns Hopkins University.

Measuring the Quality of Language Models • Word Error Rate: • Reference: The contract ended with a loss of *** seven cents. • Hypothesis: A contract ended with * loss of some even cents. • Scores: S C C C D C C I S C • Perplexity: • Perplexity measures the average number of words that can follow a given history under a language model. Center for Language and Speech Processing, The Johns Hopkins University.

Speech Recognizer (Baseline LM) Rescoring (New LM) 100 Best Hyp Speech 1 hypothesis Experimental Setup for Switchboard • American English conversations over the telephone. • Vocabulary: 22K (closed), • LM training set: 1100 conversations, 2.1M words. • Test set: WS97 dev-test set. • 19 conversations (2 hours), 18K words, • PPL=79 (back-off trigram model), • State-of-the-art systems: 30-35% WER. • Evaluation: 100-best list rescoring. Center for Language and Speech Processing, The Johns Hopkins University.

Experimental Setup for Broadcast News • American English television broadcast. • Vocabulary: open (>100K). • LM training set: 125K stories, 130M words. • Test set: Hub-4 96 dev-test set. • 21K words, • PPL=174 (back-off trigram model), • State-of-the-art systems: 25% WER. • The evaluation is based on rescoring 100-best lists of the first pass speech recognition. Center for Language and Speech Processing, The Johns Hopkins University.

The Maximum Entropy Principle • The maximum entropy (ME) principle When we make inferences based on incomplete information, we should choose the probability distribution which has the maximum entropy permitted by the information we do have. • Example (Dice) Let be the probability that the facet with dots faces-up. Seek model , that maximizes From Lagrangian So , choose : Center for Language and Speech Processing, The Johns Hopkins University.

The Maximum Entropy Principle (Cont.) • Example 2: Seek probability distribution with constraints. ( is the empirical distribution.) The feature: Empirical expectation: Maximize subject to So Center for Language and Speech Processing, The Johns Hopkins University.

Maximum Entropy Language Modeling • Use the short-hand notation • For words u, v, w , define a collection of binary features: • Obtain their target expectations from the training data. • Find • It can be shown that Center for Language and Speech Processing, The Johns Hopkins University.

Advantages and Disadvantage of Maximum Entropy Language Modeling • Advantages: • Creating a “smooth” model that satisfies all empirical constraints. • Incorporating various sources of information in a unified language model. • Disadvantage: • Computation complexity of model parameter estimation procedure. Center for Language and Speech Processing, The Johns Hopkins University.

Training an ME Model • Darroch and Ratcliff 1972: Generalized Iterative Scaling (GIS). • Della Pietra, et al 1996 : Unigram Caching and Improved Iterative Scaling (IIS). • Wu and Khudanpur 2000: Hierarchical Training Methods. • For N-gram models and many other models, the training time per iteration is strictly bounded by which is the same as that of training a back-off model. • A real running time speed-up of one to two orders of magnitude is achieved compared to IIS. • See Wu and Khudanpur ICSLP2000 for details. Center for Language and Speech Processing, The Johns Hopkins University.

Motivation for Exploiting Semantic and Syntactic Dependencies • N-gram models only take local correlation between words into account. • Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. • Need a model that exploits topic and syntax. Analysts and financial officials in the former British colonyconsiderthe contract essential to the revival of the Hong Kong futures exchange. Center for Language and Speech Processing, The Johns Hopkins University.

Motivation for Exploiting Semantic and Syntactic Dependencies • N-gram models only take local correlation between words into account. • Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. • Need a model that exploits topic and syntax. Analysts and financialofficials in the former British colony considerthe contract essential to the revival of the Hong Kong futures exchange. Center for Language and Speech Processing, The Johns Hopkins University.

f ( w ) × > t f ( w ) log threshold t f ( w ) l l l l × × × ( w ) ( w , w ) ( w , w , w ) ( topic , w ) e e e e - - - i i 1 i i 2 i 1 i i = P ( w | w , w , topic ) - - i i 2 i 1 Z ( w , w , topic ) - - i 2 i 1 # [ topic , w ] å = i P ( w , w , w | topic ) - - i 2 i 1 i # [ topic ] a w , w - - i 2 i 1 E[f] Training a Topic Sensitive Model • Cluster the training data by topic. • TF-IDF vector (excluding stop words). • Cosine similarity. • K-means clustering. • Select topic dependent words: • Estimate an ME model with topic unigram constraints: where Center for Language and Speech Processing, The Johns Hopkins University.

Recognition Using a Topic-Sensitive Model • Detect the current topic from • Recognizer’s N-best hypotheses vs. reference transcriptions. • Using N-best hypotheses causes little degradation (in perplexity and WER). • Assign a new topic for each • Conversation vs. utterance. • Topic assignment for each utterance is better than topic assignment for the whole conversation. • See Khudanpur and Wu ICASSP’99 paper and Florian and Yarowsky ACL’99 for details. Center for Language and Speech Processing, The Johns Hopkins University.

Performance of the Topic Model • The ME model with only N-gram constraints duplicates the performance of the corresponding back-off model. • The Topic dependent ME model reduces the perplexity by 7% and WER by 0.7% absolute. Center for Language and Speech Processing, The Johns Hopkins University.

Content Words vs. Stop Words • 1/5 of tokens in the test data are content-bearing words. • The WER of the baseline trigram model is relatively high for content words. • Topic dependencies are much more helpful in reducing WER of content words (1.4%) than they are for stop words (0.6%). Center for Language and Speech Processing, The Johns Hopkins University.

A Syntactic Parse and Syntactic Heads ended S’ ended VP with PP loss NP of contract PP cents NP loss NP NP The contract ended with a loss of 7 cents after … DT NN VBD IN DT NN IN CD NNS … Center for Language and Speech Processing, The Johns Hopkins University.

ended VP nti-1 contract NP nti-2 The contract ended with a loss of 7 cents after DT NN VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1 Exploiting Syntactic Dependencies • All sentences in the training set are parsed by a left-to-right parser. • A stack of parse trees for each sentence prefix is generated. T i Center for Language and Speech Processing, The Johns Hopkins University.

ended VP nti-1 contract NP nti-2 The contract ended with a loss of 7 cents after DT NN VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1 Exploiting Syntactic Dependencies (Cont.) • A probability is assigned to each word as: å - - - = × r i 1 i 1 i 1 P ( w | W ) P ( w | W , T ) ( T | W ) i 1 i i i i i Î T S i i å - = × r i 1 P ( w | w , w , h , h , nt , nt ) ( T | W ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 i i Î T S i i Center for Language and Speech Processing, The Johns Hopkins University.

Exploiting Syntactic Dependencies (Cont.) • A probability is assigned to each word as: å - - - = × r i 1 i 1 i 1 P ( w | W ) P ( w | W , T ) ( T | W ) i 1 i i i i i Î T S i i å - = × r i 1 P ( w | w , w , h , h , nt , nt ) ( T | W ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 i i Î T S i i • It is assumed that most of the useful information is embedded in the 2 preceding words and 2 preceding heads. Center for Language and Speech Processing, The Johns Hopkins University.

P ( w | w , w , h , h , nt , nt ) - - - - - - 1 2 1 2 1 2 i i i i i i i l l l l l l l × × × × × × ( ) ( , ) ( , , ) ( , ) ( , , ) ( , ) ( , , ) w w w w w w h w h h w nt w nt nt w e e e e e e e - - - - - - - - - i 1 i 2 1 i i 1 i i 2 i 1 i 1 i i 2 i 1 i = Z ( w , w , h , h , nt , nt ) - - - - - - 1 2 1 2 1 2 i i i i i i # [ w , w , w ] å = - - i 2 i 1 i P ( h , h , nt , nt , w | w , w ) - - - - - - i 1 i 2 i 1 i 2 i i 2 i 1 # [ w , w ] h , h , nt , nt - - i 2 i 1 - - - - i 2 1 i 2 i 1 # [ h , h , w ] å = - - i 2 i 1 i P ( w , w , nt , nt , w | h , h ) - - - - - - i 1 i 2 i 1 i 2 i i 2 i 1 # [ h , h ] w , w nt , nt - - i 2 i 1 - - - - 2 i 1 2 1 # [ nt , nt , w ] å = - - 2 i 1 i P ( w , w , h , h , w | nt , nt ) - - - - - - i 1 i 2 i 1 i 2 i i 2 i 1 # [ nt , nt ] w , w h , h - - i 2 i 1 - - - - 2 i 1 i 2 1 Training a Syntactic ME Model • Estimate an ME model with syntactic constraints: i where • See Chelba and Jelinek ACL’98 and Wu and Khudanpur ICASSP’00 for details. Center for Language and Speech Processing, The Johns Hopkins University.

Experimental Results of Syntactic LMs • Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute compared to those of trigrams. Center for Language and Speech Processing, The Johns Hopkins University.

ended VP contract NP The contract ended with a loss of 7 cents after DT NP VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1 Head Words inside vs. outside 3gram Range contract NP ended with a VBD IN DT The contract ended with a loss DT NP VBD IN DT h h w i-2 i-1 i w w i-2 i-1 Center for Language and Speech Processing, The Johns Hopkins University.

Syntactic Heads inside vs. outside Trigram Range • 1/4 of syntactic heads are outside trigram range. • The WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range. • Lexical heads words are much more helpful in reducing WER when they are outside trigram range (1.4%) than they are within trigram range. Center for Language and Speech Processing, The Johns Hopkins University.

Combining Topic, Syntactic and N-gram Dependencies in an ME Framework • Probabilities are assigned as: å - - = × r i 1 i 1 P ( w | W ) P ( w | w , w , h , h , nt , nt , topic ) ( T | W ) - - - - - - i 1 i i 2 i 1 i 2 i 1 i 2 i 1 i i Î T S i i • The ME composite model is trained: P ( w | w , w , h , h , nt , nt , topic ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 l l l l l l l l × × × × × × × ( w ) ( w , w ) ( w , w , w ) ( h , w ) ( h , h , w ) ( nt , w ) ( nt , h , w ) ( topic , w ) e e e e e e e e - - - - - - - - - i i 1 i i 2 i 1 i i 1 i i 2 i 1 i i 1 i i 2 i 1 i i = Z ( w , w , h , h , nt , nt , topic ) - - - - - - i 2 i 1 i 2 i 1 i 2 i 1 • Only marginal constraints are necessary. Center for Language and Speech Processing, The Johns Hopkins University.

Overall Experimental Results • Baseline trigram WER is 38.5%. • Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. • Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. • Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive. Center for Language and Speech Processing, The Johns Hopkins University.

Content Words vs. Stop words • The topic sensitive model reduces WER by 1.4% on content words, which is twice as much as the overall improvement (0.7%). • The syntactic model improves WER on both content words and stop words evenly. • The composite model has the advantage of both models and reduces WER on content words more significantly (2.1%). Center for Language and Speech Processing, The Johns Hopkins University.

Head Words inside vs. outside 3gram Range • The WER of the baseline trigram model is relatively high when head words are beyond trigram range. • Topic model helps when trigram is inappropriate. • The WER reduction for syntactic model (1.4%) is more than the overall reduction (1.0%) when head words are outside trigram range. • The WER reduction for composite model (2.2%) is more than the overall reduction (1.5%) when head words are inside trigram range. Center for Language and Speech Processing, The Johns Hopkins University.

Nominal Speed-up • Nominal Speed-up • The hierarchical training methods achieve a nominal speed-up of • two orders of magnitude for Switchboard, and • Three orders of magnitude for Broadcast News. Center for Language and Speech Processing, The Johns Hopkins University.

Real Speed-up • The real speed-up is 15-30 folds for the Switchboard task: • 30 for the trigram model. • 25 for the topic model. • 15 for the composite model. • This simplification in the training procedure make it possible the implement of ME models for large corpora. • 40 minutes for the trigram model, • 2.3 hours for the topic model. Center for Language and Speech Processing, The Johns Hopkins University.

More Experimental Results: Topic Dependent Models for BroadCast News • ME models are created for Broadcast News corpus (130M words). • The topic dependent model reduces the perplexity by 10% and WER by 0.6% absolute. • ME method is an effective means of integrating topic-dependent and topic-independent constraints. Center for Language and Speech Processing, The Johns Hopkins University.

Concluding Remarks • Non-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit have been demonstrated in the speech recognition application. • Switchboard: 13% reduction in PPL, 1.5% (absolute) in WER. (Eurospeech99 best student paper award.) • Broadcast News: 10% reduction in PPL, 0.6% in WER. (Topic constraints only; syntactic constraints in progress.) • The computational requirements for the estimation and use of maximum entropy techniques have been vastly simplified for a large class of ME models. • Nominal speedup: 100-1000 fold. • “Real” speedup: 15+ fold. • A General purpose toolkit for ME models is being developed for public release. Center for Language and Speech Processing, The Johns Hopkins University.

Acknowledgement • I thank my advisor Sanjeev Khudanpur who leads me to this field and always gives me wisdom advice and help when necessary and David Yarowsky who gives generous help during my Ph.D. program. • I thank Radu Florian and David Yarowsky for their help on topic detection and data clustering, Ciprian Chelba and Frederick Jelinek for providing the syntactic model (parser) for the SWBD experimental results reported here, and Shankar Kumar and Vlasios Doumpiotis for their help on generating N-best lists for the BN experiments. • I thank all people in the NLP lab and CLSP for their assistance in my thesis work. • This work is supported by National Science Foundation, a STIMULATE grant (IRI-9618874). Center for Language and Speech Processing, The Johns Hopkins University.

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies