140 likes | 301 Views
Topic-Dependent-Class-Based N-Gram Language Model. Welly Naptali , Masatoshi Tsuchiya, and Seiichi Nakagawa , Member, IEEE. IEEE TRANSACTIONS 2012. 報告者:郝柏翰. Outline. Introduction TDC-based n-gram language model Experimental Results Conclusion. TDC-based n-gram language model.
E N D
Topic-Dependent-Class-Based N-GramLanguage Model WellyNaptali, Masatoshi Tsuchiya, and Seiichi Nakagawa, Member, IEEE • IEEE TRANSACTIONS 2012 報告者:郝柏翰
Outline • Introduction • TDC-based n-gram language model • Experimental Results • Conclusion
TDC-based n-gram language model • The model is based on the belief that noun relations contain latent topic information. Hence, a semantic extraction method is employed along with a clustering method to reveal and define topics based on nouns only. • Given a word sequence, a fixed-size window is used to observe noun occurrences in the context history to decide the topic through voting. • Finally, the topic is integrated as a part of the word sequence in the n-gram model.
TDC-based n-gram language model • TDC standalone model suffers from a shrinking training corpus. Therefore, to achieve good results, the TDC model needs to be interpolated with a word-based n-gram as a general model.
Soft Clustering • In the LSA space, VQ is applied to cluster these words into topics. A VQ algorithm is iterated using the cosine similarity between nouns until the desired number of clusters is reached. • A confidence measure γ is defined as the distance between a word vector and its class centroid. • Previously, we mapped each noun wi into only one topic class Ci. This is known as a hard clustering technique. • To make this model more robust, soft clustering is performed so that a noun may belong to multiple topics.
Soft Voting • A TDC with window size m leads to an LM in which the probability of a word sequence W = w1,w2,w3…,wN is defined by • Z is the topic class obtained by observing m words in outer contexts of the near n-gram
Soft Voting • F is the voting score for a given window size defined as follows: • where
Interpolation • Word-Based N-Gram:We used a word-based N-gram as the LM for capturing the local constraint through linear interpolation. • Cache-Based LM:A cache-based LM is based on the notion that words appearing in a document will increase the probability of appearing again in the same document.
Interpolation • There are two ways of combining LMs, i.e., to scale the TDC before or afterit is linearly interpolated with the word-based N-gram. • Before: • After:
Experimental Results • All results show very significant improvements, especially the standalone model. The standalone model gives 47.0% relative reduction while the interpolated model gives 7.4% relative reduction of perplexity.
Experimental Results • The best perplexity achieved by TDC*CACHE+NGRAM is 87.0. • That gives 22.0% relative improve-ment against the word-based 3-gram • and 9.6% relative improvement against the TDC without a cache-based LM combi-nation.
Conclusion • A TDC is a topic-dependent LM with unsupervised topic extraction employing semantic analysis and voting on nouns. • We demonstrated that a TDC with soft clustering and/or soft voting in the training and/or test phases improved performances. • We also demonstrated that incorporating a cache-based LM improved the TDC further. • The only drawback of the TDC LM is that it causes an increase in the number of parameters when performing soft voting in the training phase.