570 likes | 589 Views
This paper explores the use of maximum entropy (ME) models to incorporate semantic, syntactic, and collocational dependencies in language modeling. The efficient training of these models is also discussed, including hierarchical training techniques. The aim is to create a model that takes into account longer and sentence-structure dependent spans in natural language. The paper concludes with future research directions.
E N D
Maximum Entropy Language Modeling with Semantic, Syntactic and Collocational Dependencies Jun Wu Department of Computer Science and Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD 21218 August 20, 2002
Outline • Motivation • Semantic (Topic) dependencies • Syntactic dependencies • Maximum entropy (ME) models with topic and syntactic dependencies • Training ME models in an efficient way • Hierarchical training (N-gram) • Generalized hierarchical training (syntactic model) • Divide-and-conquer (topic-dependent model) • Conclusion and future work
Outline • Motivation • Semantic (Topic) dependencies • Syntactic dependencies • ME models with topic and syntactic dependencies • Training ME models in an efficient way • Hierarchical training (N-gram) • Generalized hierarchical training (syntactic model) • Conclusion and future work
Exploiting Semantic and Syntactic Dependencies • N-gram models only take local correlation between words into account. • Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. • Need a model that exploits topic and syntax. Analysts and financial officials in the former British colony considerthe contract essential to the revival of the Hong Kong futures exchange.
Exploiting Semantic and Syntactic Dependencies • N-gram models only take local correlation between words into account. • Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. • Need a model that exploits topic and syntax. Analysts and financial officials in the former British colonyconsiderthe contract essential to the revival of the Hong Kong futures exchange.
Exploiting Semantic and Syntactic Dependencies • N-gram models only take local correlation between words into account. • Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency. • Need a model that exploits topic and syntax. Analysts and financial officials in the former British colonyconsider the contract essential to the revival of the Hong Kong futures exchange.
l l l l × × × ( w ) ( w , w ) ( w , w , w ) ( topic , w ) e e e e - - - i i 1 i i 2 i 1 i i = P ( w | w , w , topic ) - - i i 2 i 1 Z ( w , w , topic ) - - i 2 i 1 # [ topic , w ] å = i P ( w , w , w | topic ) - - i 2 i 1 i # [ topic ] w , w - - i 2 i 1 Training a Topic Sensitive Model • Cluster the training data by topic. • TF-IDF vector (excluding stop words). • Cosine similarity. • K-means clustering (K~70 in SWBD, ~100 in BN). • Select topic dependent words: • Estimate an ME model with N-gram and topic unigram constraints: f ( w ) ³ t t f ( w ) log t f ( w ) where
l l l l × × × ( w ) ( w , w ) ( w , w , w ) ( topic , w ) e e e e - - - i i 1 i i 2 i 1 i i = P ( w | w , w , topic ) - - i i 2 i 1 Z ( w , w , topic ) - - i 2 i 1 # [ topic , w ] å = i P ( w , w , w | topic ) - - i 2 i 1 i # [ topic ] w , w - - i 2 i 1 Training a Topic Sensitive Model • Cluster the training data by topic. • TF-IDF vector (excluding stop words). • Cosine similarity. • K-means clustering (K~70 in SWBD, ~100 in BN). • Select topic dependent words: • Estimate an ME model with N-gram and topic unigram constraints: where
l l l l × × × ( w ) ( w , w ) ( w , w , w ) ( topic , w ) e e e e - - - i i 1 i i 2 i 1 i i = P ( w | w , w , topic ) - - i i 2 i 1 Z ( w , w , topic ) - - i 2 i 1 # [ topic , w ] å = i P ( w , w , w | topic ) - - i 2 i 1 i # [ topic ] w , w - - i 2 i 1 Training a Topic Sensitive Model • Cluster the training data by topic. • TF-IDF vector (excluding stop words). • Cosine similarity. • K-means clustering (K~70 in SWBD, ~100 in BN). • Select topic dependent words: • Estimate an ME model with N-gram and topic unigram constraints: where
Experimental Setup • Switchboard • WS97 dev test set. • Vocabulary: 22K (closed), • LM training set: 1100 conversations, 2.1M words, • AM training set: 60 hours of speech data, • Acoustic model: state-clustered cross-word triphone model, • Front end: 13 MF-PLP + Δ + Δ Δ , per conv. side CMS, • Test set: 19 conversations (2 hours), 18K words, • No speaker adaptation. • The evaluation is based on rescoring 100-best lists of the first pass speech recognition.
Experimental Setup (Cont.) • Broadcast News • Hub-4 96 eval set. • Vocabulary: 64K, • LM training set: 125K stories, 130M words, • AM training set: 72 hours of speech data, • Acoustic model: state-clustered cross-word triphone model, • Front end: 13 MFCC + Δ + Δ Δ , • Test set: 2 hours, 22K words, • No speaker adaptation. • The evaluation is based on rescoring 100-best lists of the first pass speech recognition.
Experimental Results (Switchboard) • Baseline trigram model: • PPL-79, WER-38.5%. • Using N-best hypotheses causes little degradation. • Topic assignment based on utterances brings a slightly better result than that based on whole conversations. • Topic dependencies reduce perplexity by 7% and WER by 0.7% absolute.
Experimental Results (Switchboard) • Baseline trigram model: • PPL-79, WER-38.5%. • Using N-best hypotheses causes little degradation. • Topic assignment based on utterances brings a slightly better result than that based on whole conversations. • Topic dependencies reduce perplexity by 7% and WER by 0.7% absolute.
Experimental Results (Switchboard) • Baseline trigram model: • PPL-79, WER-38.5%. • Using N-best hypotheses causes little degradation. • Topic assignment based on utterances brings a slightly better result than that based on whole conversations. • Topic dependencies reduce perplexity by 7% and WER by 0.7% absolute.
Experimental Results (Broadcast News) • Utterance level topic detection based on 10 best lists. • The ME trigram model duplicates the performance of the corresponding backoff model. • Topic-dependencies help reduce perplexity by 10% and WER 0.6% (absolute).
ended VP nti-1 contract NP nti-2 The contract ended with a loss of 7 cents after DT NN VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1 Exploiting Syntactic Dependencies • All sentences in the training set are parsed by a left-to-right parser. • A stack of parse trees for each sentence prefix is generated.
Exploiting Syntactic Dependencies (Cont.) • A probability is assigned to each word as: å - - - = × r i 1 i 1 i 1 P ( w | w ) P ( w | w , T ) ( T | w ) i 1 i 1 i i 1 Î T S i i å - = × r i 1 P ( w | w , w , h , h , nt , nt ) ( T | w ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 i 1 Î T S i i ended VP contract nti-1 NP nti-2 The contract ended with a loss of 7 cents after DT NN VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1
w , w - - 1 2 i i - - - - - i i i 1 1 1 i i 2 2 # [ h , h , w ] å = - - i 2 i 1 i P ( w , w , nt , nt , w | h , h ) - - - - - - i 1 i 2 i 1 i 2 i i 2 i 1 # [ h , h ] w , w nt , nt - - i 2 i 1 - - - - 2 i 1 2 1 - - i i 2 2 Training a Syntactic ME Model • Estimate an ME model with syntactic constraints: • See Khudanpur and Wu CSL’00 and Chelba and Jelinek ACL’98 for details. P ( w | , h , h , nt , nt ) - - - - 1 2 1 2 i i i i i l l l l l l l × × × × × × ( ) ( , ) ( , , ) ( , ) ( , , ) ( , ) ( , , ) w w w w w w h w h h w nt w nt nt w e e e e e e e i - - - - i i i i i 1 i 1 i i 2 i 1 i = Z ( w , w , h , h , nt , nt ) - - - - - - 1 2 1 2 1 2 i i i i i i # [ w , w , w ] å = - - i 2 i 1 i P ( h , h , nt , nt , w | w , w ) where - - - - - - i 1 i 2 i 1 i 2 i i 2 i 1 # [ w , w ] h , h , nt , nt - - i 2 i 1 - - - - i 2 1 i 2 i 1 # [ nt , nt , w ] å = - i 1 i P ( w , w , h , h , w | nt , nt ) - - - - - - i 1 i 2 i 1 i 2 i i 2 i 1 # [ nt , nt ] w , w h , h - i 1 - - - - 2 i 1 i 2 1
Experimental Results for Switchboard • Baseline Katz back-off trigram model: • PPL – 79, WER – 38.5%. • Interpolated mode: • PPL – 4%, WER – 0.6%. • Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. • ME model achieves better performance than interpolated model (Chelba & Jelinek).
Experimental Results for Switchboard • Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. • Head word N-gram constraints result similar improvement. • Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. • ME model achieves better performance than interpolated model (Chelba & Jelinek).
Experimental Results for Switchboard • Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. • Head word N-gram constraints result similar improvement. • Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. • ME model achieves better performance than interpolated model (Chelba & Jelinek).
Experimental Results for Switchboard • Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. • Head word N-gram constraints result similar improvement. • Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. • ME model achieves better performance than interpolated model (Chelba & Jelinek).
Experimental Results for Switchboard • Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.7% absolute. • Head word N-gram constraints result similar improvement. • Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute. • ME model achieves better performance than interpolated model (Chelba & Jelinek).
Experimental Results for Broadcast News* • Baseline trigram model: • PPL – 214, WER – 35.3%. • Interpolated mode: • PPL – 7%, WER – 0.6%. *14M words of data
Experimental Results for Broadcast News • Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.4% absolute. • Head word N-gram constraints result similar improvement. • Non-terminal constraints and syntactic constraints together reduce the perplexity by 7% and WER by 0.7% absolute. • ME model achieves slightly better performance than interpolated model. *14M words of data
Experimental Results for Broadcast News • Non-terminal (NT) N-gram constraints alone reduce perplexity by 5% and WER by 0.4% absolute. • Head word N-gram constraints result similar improvement. • Non-terminal constraints and syntactic constraints together reduce the perplexity by 7% and WER by 0.7% absolute. • ME model achieves slightly better performance than interpolated model. *14M words of data
Combining Topic, Syntactic and N-gram Dependencies in an ME Framework • Probabilities are assigned as: • The ME composite model is trained: • Only marginal trigram like constraints are necessary. å - - = × r i 1 i 1 P ( w | w ) P ( w | w , w , h , h , nt , nt , topic ) ( T | w ) - - - - - - i 1 i i 2 i 1 i 2 i 1 i 2 i 1 i 1 Î T S i i P ( w | w , w , h , h , nt , nt , topic ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 l l l l l l l l × × × × × × × ( w ) ( w , w ) ( w , w , w ) ( h , w ) ( h , h , w ) ( nt , w ) ( nt , h , w ) ( topic , w ) e e e e e e e e - - - - - - - - - i i 1 i i 2 i 1 i i 1 i i 2 i 1 i i 1 i i 2 i 1 i i = Z ( w , w , h , h , nt , nt , topic ) - - - - - - i 2 i 1 i 2 i 1 i 2 i 1
l w ( topic , ) e i Combining Topic, Syntactic and N-gram Dependencies in an ME Framework • Probabilities are assigned as: • The ME composite model is trained: • Only marginal trigram like constraints are necessary. å - - = × r i 1 i 1 P ( w | w ) P ( w | w , w , h , h , nt , nt , topic ) ( T | w ) - - - - - - i 1 i i 2 i 1 i 2 i 1 i 2 i 1 i 1 Î T S i i P ( w | w , w , h , h , nt , nt , topic ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 l l l l l l l × × × × × × × ( w ) ( w , w ) ( w , w , w ) ( h , w ) ( h , h , w ) ( nt , w ) ( nt , h , w ) e e e e e e e - - - - - - - - - i i 1 i i 2 i 1 i i 1 i i 2 i 1 i i 1 i i 2 i 1 i i = Z ( w , w , h , h , nt , nt , topic ) - - - - - - i 2 i 1 i 2 i 1 i 2 i 1
Overall Experimental Results for SWBD • Baseline trigram WER is 38.5%. • Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute. • Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER. • Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute. The gains from topic and syntactic dependencies are nearly additive.
Overall Experimental Results for BN • Repeated improvements are achieved on the 14M Broadcast News task • topic-dependent constraints and Syntactic constraints individually reduce WER by 0.7% absolute. • They together reduce the WER by 1.2%. • This WER is lower than the trigram model trained with 130M words. The gains from topic and syntactic dependencies are nearly additive.
Advantages and Disadvantage of Maximum Entropy Method • Advantages: • Creating a “smooth” model that satisfies all empirical constraints. • Incorporating various sources of information (e.g. topic and syntax) in a unified language model. • Disadvantages: • High computational complexity of model parameter estimation procedure. • Heavy computation load in using ME models during recognition.
Advantages and Disadvantage of Maximum Entropy Method • Advantages: • Creating a “smooth” model that satisfies all empirical constraints. • Incorporating various sources of information (e.g. topic and syntax) in a unified language model. • Disadvantages: • High computational complexity of model parameter estimation procedure. • Heavy computation load in using ME models during recognition.
Estimating Model Parameters Using GIS • Trigram model: • where • Generalized Iterative Scaling (GIS) can be used to compute ’s . • Estimating each unigram feature, bigram feature and trigram feature parameters needs , and respectively. Totally, the complexity is . The first term dominates the computation. E.g., in Switchboard, .
Estimating Model Parameters Using GIS • Trigram model: • where • Generalized Iterative Scaling (GIS) can be used to compute ’s . • Estimating each unigram feature, bigram feature and trigram feature parameters needs , and respectively. Totally, the complexity is . The first term dominates the computation. E.g., in Switchboard, .
Estimating Model Parameters Using GIS • Trigram model: • where • Generalized Iterative Scaling (GIS) can be used to compute ’s . • Estimating each unigram feature, bigram feature and trigram feature parameters needs , and respectively. Totally, the complexity is . The first term dominates the computation. E.g., in Switchboard, .
The Computation of Denominators • For each , we need to compute • Computing the denominator for all histories takes time. • Computing the expectation for all unigram features needs the same amount of time. • requires a sum over all for a given history; • requires a sum over all history for a given . • Any simplification made on the calculation of denominators can be applied to feature expectation. • We focus on the computation of Denominators. P ( w , w ) å = a a a ( ) ( , ) ( , , ) ( ) g w g w w g w w w n - - 2 1 i i E [ g ( w )] ( w ) ( w , w ) ( w , w , w ) - - - 1 i 2 i 1 i 3 i 2 i 1 i - - - 1 1 2 1 i i i i i i i Z ( w , w ) , w w - - 2 1 i i - - i 2 i 1 Z ( w , w ) w - - 2 1 i i i w , w w × E [ ] - - 2 1 i i i
The Computation of Denominators • For each , we need to compute • Computing the denominator for all histories takes time. • Computing the expectation for all unigram features needs the same amount of time. • requires a sum over all for a given history; • requires a sum over all history for a given . • Any simplification made on the calculation of denominators can be applied to feature expectation. • We focus on the computation of Denominators.
The Computation of Denominators • For each , we need to compute • Computing the denominator for all histories takes time. • Computing the expectation for all unigram features needs the same amount of time. • requires a sum over all for a given history; • requires a sum over all history for a given . • Any simplification made on the calculation of denominators can be applied to feature expectation. • We focus on the computation of Denominators.
State-of-the-Art Implementation • Della Pietra etc suggest . • The straight-forward implementation needs . • SWBD: • 6 billion vs 8 trillion (~1300 fold). • For a given history, only a few words have conditional (bigram or trigram) features activated.
State-of-the-Art Implementation • Della Pietra etc suggest . • The straight-forward implementation needs . • SWBD: • 6 billion vs 8 trillion (~1300 fold). • For a given history, only a few words have conditional (bigram or trigram) features activated.
State-of-the-Art Implementation (cont’d) Unigram-Caching • Unigram-caching (Della Pietra etc): • Complexity: . • In practice, . • E.g. SWBD: , 120 million vs 6 billion (~50).
Hierarchical Training Unigram caching is still too “slow” for large corpora, e.g. BN . Value of • Computational complexity , which is the same as that of training a back-off trigram model. • This method can be extend to N-gram models, with the training time per iteration exactly the same as that of the empirical estimation.
Hierarchical Training Unigram caching is still too “slow” for large corpora, e.g. BN . • Computational complexity , which is the same as that of training a back-off trigram model. • This method can be extend to N-gram models, with the training time per iteration exactly the same as that of the empirical estimation.
Hierarchical Training Unigram caching is still too “slow” for large corpora, e.g. BN . • Computational complexity , which is the same as that of training a back-off trigram model. • This method can be extend to N-gram models, with the training time per iteration exactly the same as that of the empirical estimation.
Speed-up of the Hierarchical Training Method • Baseline: Unigram-caching (Della Pietra, et al.) • Nominal Speed-up • The hierarchical training methods achieve • a nominal speed-up of two orders of magnitude for Switchboard, and three orders of magnitude for Broadcast News. • a real speed-up of 30 folds for SWBD, 85 folds for BN.
Feature Hierarchy • N-gram Model • Syntactic Model 1 = a a a a g ( w ) g ( w , w ) g ( h , w ) g ( nt , w ) P ( w | w , w , h , h , nt , nt ) ( w ) ( w , w ) ( h , w ) ( nt , w ) - - - 1 i 2 i 1 i 3 i 1 i 4 i 1 i - - - - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 i i 1 i i 1 i i 1 i Z a a a g ( w , w , w ) g ( h , h , w ) g ( nt , nt , w ) ( w , w , w ) ( h , h , w ) ( nt , nt , w ) - - - - - - 5 i 2 i 1 i 6 i 2 i 1 i 7 i 2 i 1 i - - - - - - i 2 i 1 i i 2 i 1 i i 2 i 1 i
Generalized Hierarchical Training • Syntactic Model 1 = a a a a g ( w ) g ( w , w ) g ( h , w ) g ( nt , w ) P ( w | w , w , h , h , nt , nt ) ( w ) ( w , w ) ( h , w ) ( nt , w ) - - - 1 i 2 i 1 i 3 i 1 i 4 i 1 i - - - - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 i i 1 i i 1 i i 1 i Z a a a g ( w , w , w ) g ( h , h , w ) g ( nt , nt , w ) ( w , w , w ) ( h , h , w ) ( nt , nt , w ) - - - - - - 5 i 2 i 1 i 6 i 2 i 1 i 7 i 2 i 1 i - - - - - - i 2 i 1 i i 2 i 1 i i 2 i 1 i
Speed-up of the Generalized Hierarchical Training • Evaluation is based on the syntactic model. • The generalized hierarchical training method achieves • a nominal speed-up of about two orders of magnitude for the Switchboard and Broadcast News. • a real speed-up of 17 folds in Switchboard. • Training the syntactic model for the subset of Broadcast News is impossible without GHT, and it needs 480 CPU-hours per iteration even after speed up.
Training Topic Models and Composite Models by Hierarchical Training and Divide-and-Conquer • Topic-dependent models can be trained by divide-and-conquer. • Partition the training data in parts, train the model based on each parts and then collect partial feature expectations. • Divide-and-conquer and be used together with hierarchical training. • Real training time for topic-dependent models: • SWBD:21 CPU-hours 0.5 CPU-hours. • BN: (~85) CPU-hours 2.3 CPU-hours.
Simplify the Computation in Calculating Probabilities Using ME Models • Converting ME N-gram models to ARPA back-off format. • Speedup: 1000+ fold. • Approximating denominator for topic-dependent models. • Speedup: 400+ fold. • Almost the same speech recognition accuracy. • Caching the last accessed histories. • Speedup: 5 fold.