110 likes | 205 Views
SEMI-SUPERVISED LEARNING OF LANGUAGE MODEL USING UNSUPERVISED TOPIC MODEL Shuanhu Bai1, Chien -Lin Huang1, Bin Ma1 and Haizhou Li. Outline . Introduction Learning method Learning strategies Weighted topic decomposition Weighted N-gram counts Experiments Conclusions. Introduction.
E N D
SEMI-SUPERVISED LEARNING OF LANGUAGE MODELUSING UNSUPERVISED TOPIC MODELShuanhu Bai1, Chien-Lin Huang1, Bin Ma1 and HaizhouLi
Outline • Introduction • Learning method • Learning strategies • Weighted topic decomposition • Weighted N-gram counts • Experiments • Conclusions
Introduction • The efforts for building domain-specific LMs have been mostly focused on the issue of obtaining training texts from various sources such as the Web. • Although most irrelevant data can be filtered out by search engines, the data collected from the Web can not be directly used without additional clean-up. • These text selection schemes can be regarded as text or sentence classification methods.Texts or sentences falling into the domain class are used for domain-specific LM training.
Introduction(cont.) • Here, they introduce a novel semi-supervised learning(SSL) method for learning domain-specific LMs from general-domain data using a PLSA topic model. • The latent topics of the PLSA topic model are used as a media for learning. The topic decomposition (TD) mechanism of PLSA is used to derive the topic distribution of the documents. • In this way, given a small domain-specific dataset, DI , we could build a domain specific LM by mining a larger general-domain dataset, DG, for “useful” information.
Learning strategies • In the framework of the PLSA topic model, the distribution of a word w in a document d can be described as mixture of topics t : • D, a process also know as topic decomposition, while D is a combined dataset • We assume that the topic distribution of incoming documents could be simply modeled by p(t | DI ) during the decoding
Weighted Topic Decomposition • We may have a small amount of domain-specific text and a plenty of general-domain text. If we equally combine the two datasets and feed them into the training process, the general-domain data may overwhelm the domain-specific data. • The new parameter introduced into the likelihood function decrease the contribution of general-domain data with the values of • When we apply this learning strategy to the PLSA framework specified by Eq.(1), the log-likelihood of the general-domain data in Eq.(4) can be expanded as:
Weighted N-gram Counts • In the PLSA topic model, the mixture components are word unigram models. • Let hw represent a word n-gram sequence where h stands for a word sequence of length n -1 ( h is empty for a word unigram) and w is an arbitrary word. • if we regard p(t|d) as the result of soft classification of the documents, then the count of n-gram hw with respect to topic t , by taking the weighting factor for datasets DG into consideration, can be express as:
The modeling will be conducted based on mixture of counts instead of mixture of probabilities,
Conclusions • The technique presented in this paper provides a PLSA-based n-gram weighting scheme. • The application of this weighting method can be far beyond language modeling tasks, and can be used to perform other domain-specific language-use statistics such as the term extraction and co-occurrence mining. • This technique can also be used as a smoothing method for domain-specific knowledge.