Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies Jun Wu Advisor: Sanjeev Khudanpur Department of Computer Science and Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218 May, 2001 NSF STIMULATE Grant No. IRI-9618874 Center for Language and Speech Processing, The Johns Hopkins University.

Outline • Motivation • Semantic (Topic) dependencies in natural language • Syntactic dependencies in natural language • ME models with topic and syntactic dependencies • Training ME models in an efficient way (1 hour) • Conclusion and future work Center for Language and Speech Processing, The Johns Hopkins University.

Outline • Motivation • Semantic (Topic) dependencies in natural language • Syntactic dependencies in natural language • ME models with topic and syntactic dependencies • Training ME models in an efficient way (5 mins) • Conclusion and future work Center for Language and Speech Processing, The Johns Hopkins University.

Exploiting Semantic and Syntactic Dependencies • N-gram models only take local correlation between words into account. • Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. • Need a model that exploits topic and syntax. Analysts and financial officials in the former British colony considerthe contract essential to the revival of the Hong Kong futures exchange. Center for Language and Speech Processing, The Johns Hopkins University.

Exploiting Semantic and Syntactic Dependencies • N-gram models only take local correlation between words into account. • Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. • Need a model that exploits topic and syntax. Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange. Center for Language and Speech Processing, The Johns Hopkins University.

Exploiting Semantic and Syntactic Dependencies • N-gram models only take local correlation between words into account. • Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. • Need a model that exploits topic and syntax. Analysts and financialofficials in the former British colony considerthe contract essential to the revival of the Hong Kong futures exchange. Center for Language and Speech Processing, The Johns Hopkins University.

Maximum Entropy Language Modeling • Model where: • For define a collection of binary features: • Obtain their target expectations from the training data. • Find • It can be shown that Center for Language and Speech Processing, The Johns Hopkins University.

Advantages and Disadvantage of Maximum Entropy Language Modeling • Advantages: • Creating a “smooth” model that satisfies all empirical constraints. • Incorporating various sources of information in a unified language model. • Disadvantage: • Computation complexity of model parameter estimation procedure. (solved!) Center for Language and Speech Processing, The Johns Hopkins University.

l l l l × × × ( w ) ( w , w ) ( w , w , w ) ( topic , w ) e e e e - - - i i 1 i i 2 i 1 i i = P ( w | w , w , topic ) - - i i 2 i 1 Z ( w , w , topic ) - - i 2 i 1 # [ topic , w ] å = i P ( w , w , w | topic ) - - i 2 i 1 i # [ topic ] w , w - - i 2 i 1 Training a Topic Sensitive Model • Cluster the training data by topic. • TF-IDF vector (excluding stop words). • Cosine similarity. • K-means clustering. • Select topic dependent words: • Estimate an ME model with topic unigram constraints: f ( w ) × > t f ( w ) log threshold t f ( w ) Where Center for Language and Speech Processing, The Johns Hopkins University.

l l l l × × × ( w ) ( w , w ) ( w , w , w ) ( topic , w ) e e e e - - - i i 1 i i 2 i 1 i i = P ( w | w , w , topic ) - - i i 2 i 1 Z ( w , w , topic ) - - i 2 i 1 # [ topic , w ] å = i P ( w , w , w | topic ) - - i 2 i 1 i # [ topic ] w , w - - i 2 i 1 Training a Topic Sensitive Model • Cluster the training data by topic. • TF-IDF vector (excluding stop words). • Cosine similarity. • K-means clustering. • Select topic dependent words: • Estimate an ME model with topic unigram constraints: f ( w ) × > t f ( w ) log threshold t f ( w ) where Center for Language and Speech Processing, The Johns Hopkins University.

Recognition Using a Topic-Sensitive Model • Detect the current topic from • Recognizer’s N-best hypotheses vs. reference transcriptions. • Using N-best hypotheses causes little degradation (in perplexity and WER). • Assign a new topic for each • Conversation vs. utterance. • Topic assignment for each utterance is better than topic assignment for the whole conversation. • See Khudanpur and Wu ICASSP’99 paper and Florian and Yarowsky ACL’99 for details. Center for Language and Speech Processing, The Johns Hopkins University.

Experimental Setup for Switchboard • The experiments are based on WS97 dev test set. • Vocabulary: 22K (closed), • LM training set: 1100 conversations, 2.1M words, • AM training set: 60 hours of speech data, • Acoustic model: state-clustered cross-word triphone model, • Front end: 13 MF-PLP + + , per conv. side CMS, • Test set: 19 conversations (2 hours), 18K words, • No speaker adaptation. • The evaluation is based on rescoring 100-best lists of the first pass speech recognition. Center for Language and Speech Processing, The Johns Hopkins University.

Topic Assignment During Testing : Reference Trans vs Hypotheses • Even with a WER of over 38%, there is only a small loss in perplexity and a negligible loss in WER when the topic assignment is based on recognizer hypotheses instead of the correct transcriptions. • Comparisons with the oracle indicate that there is little room for further improvement. Center for Language and Speech Processing, The Johns Hopkins University.

ME Method vs Interpolation • ME model with only topic dependent unigram constraints outperforms the interpolated topic dependent trigram model. • ME method is an effective means of integrating topic-dependent and topic-independent constraints. Center for Language and Speech Processing, The Johns Hopkins University.

Topic Model vs Cache-Based Model • Cache-based model reduces the perplexity, but increase the WER. • Cache-based model brings (0.6%) more repeated errors than the trigram model does. • Cache model may not be practical when the baseline WER is high. Center for Language and Speech Processing, The Johns Hopkins University.

Summary of Topic-Dependent Language Modeling • We significantly reduce both the perplexity (7%) and WER (0.7% absolute) by incorporating a small number of topic constraints with N-grams using the ME method. • Using N-best hypotheses causes little degradation (in perplexity and WER). • Topic assignment at utterance level is better than that at conversation level. • ME method is more efficient than linear interpolation in combining topic dependencies with N-grams. • The topic dependent model is better than the cache-based model in reducing WER when the baseline is poor. Center for Language and Speech Processing, The Johns Hopkins University.

A Syntactic Parse and Syntactic Heads ended S’ ended VP with PP loss NP of contract PP cents NP loss NP NP The contract ended with a loss of 7 cents after … DT NN VBD IN DT NN IN CD NNS … Center for Language and Speech Processing, The Johns Hopkins University.

ended VP nti-1 contract NP nti-2 The contract ended with a loss of 7 cents after DT NN VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1 Exploiting Syntactic Dependencies • All sentences in the training set are parsed by a left-to-right parser. • A stack of parse trees for each sentence prefix is generated. T i Center for Language and Speech Processing, The Johns Hopkins University.

ended VP nti-1 contract NP nti-2 The contract ended with a loss of 7 cents after DT NN VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1 Exploiting Syntactic Dependencies (Cont.) • A probability is assigned to each word as: å - - - = × r i 1 i 1 i 1 P ( w | W ) P ( w | W , T ) ( T | W ) i 1 i 1 i i 1 Î T S i i å - = × r i 1 P ( w | w , w , h , h , nt , nt ) ( T | W ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 i 1 Î T S i i Center for Language and Speech Processing, The Johns Hopkins University.

Exploiting Syntactic Dependencies (Cont.) • A probability is assigned to each word as: å - - - = × r i 1 i 1 i 1 P ( w | W ) P ( w | W , T ) ( T | W ) i 1 i 1 i i 1 Î T S i i å - = × r i 1 P ( w | w , w , h , h , nt , nt ) ( T | W ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 i 1 Î T S i i • It is assumed that most of the useful information is embedded in the 2 preceding words and 2 preceding heads. Center for Language and Speech Processing, The Johns Hopkins University.

P ( w | , ) w , w h , h , nt , nt i - - - - - - 1 2 1 2 1 2 i i i i i i l l l l l l l × × × × × × ( ) ( , ) ( , , ) ( , ) ( , , ) ( , ) ( , , ) w w w w w w h w h h w nt w nt nt w e e e e e e e i - - - - - - - - - i i i i i i 1 1 1 i i i 2 2 i 1 i 1 i i 2 i 1 i = Z ( w , w , h , h , nt , nt ) - - - - - - 1 2 1 2 1 2 i i i i i i # [ w , w , w ] å = - - i 2 i 1 i P ( h , h , nt , nt , w | w , w ) - - - - - - i 1 i 2 i 1 i 2 i i 2 i 1 # [ w , w ] h , h , nt , nt - - i 2 i 1 - - - - i 2 1 i 2 i 1 # [ h , h , w ] å = - - i 2 i 1 i P ( w , w , nt , nt , w | h , h ) - - - - - - i 1 i 2 i 1 i 2 i i 2 i 1 # [ h , h ] w , w nt , nt - - i 2 i 1 - - - - 2 i 1 2 1 # [ nt , nt , w ] å = - i 1 i P ( w , w , h , h , w | nt , nt ) - - - - - - i 1 i 2 i 1 i 2 i i 2 i 1 # [ nt , nt ] w , w h , h - - - i i 2 2 i 1 - - - - 2 i 1 i 2 1 Training a Syntactic ME Model • Estimate an ME model with syntactic constraints: where • See Chelba and Jelinek ACL’98 and Wu and Khudanpur ICASSP’00 for details. Center for Language and Speech Processing, The Johns Hopkins University.

Experimental Results of Syntactic LMs • Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. • Head word N-gram constraints result in a reduction of 0.6% in perplexity and 0.8% absolute in WER. • Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. Center for Language and Speech Processing, The Johns Hopkins University.

Experimental Results of Syntactic LMs • Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. • Head word N-gram constraints result in 6% reduction in perplexity and 0.8% absolute in WER. • Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. Center for Language and Speech Processing, The Johns Hopkins University.

ME vs Interpolation • The ME model is more effective in using syntactic dependencies than the interpolation model. Center for Language and Speech Processing, The Johns Hopkins University.

ended VP contract NP The contract ended with a loss of 7 cents after DT NP VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1 Head Words inside vs. outside 3gram Range contract NP ended with a VBD IN DT The contract ended with a loss DT NP VBD IN DT h h w i-2 i-1 i w w i-2 i-1 Center for Language and Speech Processing, The Johns Hopkins University.

Syntactic Heads inside vs. outside Trigram Range • The WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range. • Lexical head words are much more helpful in reducing WER when they are outside trigram range (1.5%) than they are within trigram range. • However, non-terminal N-gram constraints help almost evenly in both cases. • Can this gain be obtained from POS class model too? • The WER reduction for the model with both head word and non-terminal constraints (1.4%) is more than the overall reduction (1.0%) when head words are beyond trigram range. Center for Language and Speech Processing, The Johns Hopkins University.

l l l l l w × × × × ( ) ( w , w ) ( w , w , w ) ( pos , w ) ( pos , pos , w ) e e e e e - - - - - - i i 1 i i 2 i 1 i i 1 i i i 2 1 i = P ( w | w , w , pos , pos ) - - - - i i 1 i 2 i 1 i 2 Z ( w , w , pos , pos ) - - - - i 1 i 2 i 1 i 2 Contrasting the Smoothing Effect of NT Class LM vs POS Class LM • An ME model with part-of-speech (POS) N-gram constraints is built as: • POS model reduces PPL by 4% and WER by 0.5%. • The overall gains from POS N-gram constraints are smaller than those from NT N-gram constraints. • Syntactic analysis seems to perform better than just using the two previous word positions. Center for Language and Speech Processing, The Johns Hopkins University.

POS Class LM vs NT Class LM • When the syntactic heads are beyond trigram range, the trigram coverage in the test set is relatively low. • The back-off effect by the POS N-gram constraints is effective in reducing WER in this case. • NT N-gram constraints work in a similar manner. Overall, they are more effective perhaps because they are linguistically more meaningful. • Performance improves further when lexical head words are applied on the top of the non-terminals. Center for Language and Speech Processing, The Johns Hopkins University.

Summary of Syntactic Language Modeling • Syntactic heads in the language model are complementary to N-grams: the model improves significantly when the syntactic heads are beyond N-gram range. • Head word constraints provide syntactic information. Non-terminals mainly provide a smoothing effect. • Non-terminals are linguistically more meaningful predictors than POS tags, and therefore are more effective in supplementing N-grams. • The Syntactic model reduces perplexity by 6.3%, WER by 1.0% (absolute). Center for Language and Speech Processing, The Johns Hopkins University.

Combining Topic, Syntactic and N-gram Dependencies in an ME Framework • Probabilities are assigned as: å - - = × r i 1 i 1 P ( w | W ) P ( w | w , w , h , h , nt , nt , topic ) ( T | W ) - - - - - - i 1 i i 2 i 1 i 2 i 1 i 2 i 1 i 1 Î T S i i • The ME composite model is trained: P ( w | w , w , h , h , nt , nt , topic ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 l l l l l l l l × × × × × × × ( w ) ( w , w ) ( w , w , w ) ( h , w ) ( h , h , w ) ( nt , w ) ( nt , h , w ) ( topic , w ) e e e e e e e e - - - - - - - - - i i 1 i i 2 i 1 i i 1 i i 2 i 1 i i 1 i i 2 i 1 i i = Z ( w , w , h , h , nt , nt , topic ) - - - - - - i 2 i 1 i 2 i 1 i 2 i 1 • Only marginal constraints are necessary. Center for Language and Speech Processing, The Johns Hopkins University.

Overall Experimental Results • Baseline trigram WER is 38.5%. • Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. • Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. • Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive. Center for Language and Speech Processing, The Johns Hopkins University.

Content Words vs. Stop words • The topic sensitive model reduces WER by 1.4% on content words, which is twice as much as the overall improvement (0.7%). • The syntactic model improves WER more on stop words than on content words. Why? • Many content words do not have syntactic constraints. • The composite model has the advantage of both models and reduces WER on content words more significantly (2.1%). Center for Language and Speech Processing, The Johns Hopkins University.

Head Words inside vs. outside 3gram Range • The WER of the baseline trigram model is relatively high when head words are beyond trigram range. • Topic model helps when trigram is inappropriate. • The WER reduction for syntactic model (1.4%) is more than the overall reduction (1.0%) when head words are outside trigram range. • The WER reduction for composite model (2.2%) is more than the overall reduction (1.5%) when head words are inside trigram range. Center for Language and Speech Processing, The Johns Hopkins University.

Training an ME Model • Darroch and Ratcliff 1972: Generalized Iterative Scaling (GIS). • Della Pietra, et al 1996 : Unigram Caching and Improved Iterative Scaling (IIS). • Wu and Khudanpur 2000: Hierarchical Training Methods. • For N-gram models and many other models, the training time per iteration is strictly bounded by which is the same as that of training a back-off model. • A real running time speed-up of one to two orders of magnitude is achieved compared to IIS. • See Wu and Khudanpur ICSLP2000 for details. Center for Language and Speech Processing, The Johns Hopkins University.

Experimental Setup for Broadcast News • American English television broadcast. • Vocabulary: open (>100K). • LM training set: 125K stories, 130M words, • AM training set: 70 hours of speech data, • Acoustic model: state-clustered cross-word triphone model, • Trigram model: T>1, B>2, 9.1M constraints, • Front end: 13 MFCC + + , • No speaker adaptation, • Test set: Hub-4 96 dev-test set, 21K words. • The evaluation is based on rescoring 100-best lists of the first pass speech recognition. Center for Language and Speech Processing, The Johns Hopkins University.

# of Operations and Nominal Speed-up • Baseline: IIS + Unigram-caching (Della Pietra, et al.) • Nominal Speed-up • The hierarchical training methods achieve a nominal speed-up of • two orders of magnitude for Switchboard, and • Three orders of magnitude for Broadcast News. Center for Language and Speech Processing, The Johns Hopkins University.

Real Running Time • The real speed-up is 15-30 folds for the Switchboard task: • 30 for the trigram model. • 25 for the topic model. • 15 for the composite model. • This simplification in the training procedure make it possible the implement of ME models for large corpora. • 40 minutes for the trigram model, • 2.3 hours for the topic model. Center for Language and Speech Processing, The Johns Hopkins University.

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies