1 / 11

Outline

SEMI-SUPERVISED LEARNING OF LANGUAGE MODEL USING UNSUPERVISED TOPIC MODEL Shuanhu Bai1, Chien -Lin Huang1, Bin Ma1 and Haizhou Li. Outline . Introduction Learning method Learning strategies Weighted topic decomposition Weighted N-gram counts Experiments Conclusions. Introduction.

morley
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SEMI-SUPERVISED LEARNING OF LANGUAGE MODELUSING UNSUPERVISED TOPIC MODELShuanhu Bai1, Chien-Lin Huang1, Bin Ma1 and HaizhouLi

  2. Outline • Introduction • Learning method • Learning strategies • Weighted topic decomposition • Weighted N-gram counts • Experiments • Conclusions

  3. Introduction • The efforts for building domain-specific LMs have been mostly focused on the issue of obtaining training texts from various sources such as the Web. • Although most irrelevant data can be filtered out by search engines, the data collected from the Web can not be directly used without additional clean-up. • These text selection schemes can be regarded as text or sentence classification methods.Texts or sentences falling into the domain class are used for domain-specific LM training.

  4. Introduction(cont.) • Here, they introduce a novel semi-supervised learning(SSL) method for learning domain-specific LMs from general-domain data using a PLSA topic model. • The latent topics of the PLSA topic model are used as a media for learning. The topic decomposition (TD) mechanism of PLSA is used to derive the topic distribution of the documents. • In this way, given a small domain-specific dataset, DI , we could build a domain specific LM by mining a larger general-domain dataset, DG, for “useful” information.

  5. Learning strategies • In the framework of the PLSA topic model, the distribution of a word w in a document d can be described as mixture of topics t : • D, a process also know as topic decomposition, while D is a combined dataset • We assume that the topic distribution of incoming documents could be simply modeled by p(t | DI ) during the decoding

  6. Weighted Topic Decomposition • We may have a small amount of domain-specific text and a plenty of general-domain text. If we equally combine the two datasets and feed them into the training process, the general-domain data may overwhelm the domain-specific data. • The new parameter introduced into the likelihood function decrease the contribution of general-domain data with the values of • When we apply this learning strategy to the PLSA framework specified by Eq.(1), the log-likelihood of the general-domain data in Eq.(4) can be expanded as:

  7. Weighted N-gram Counts • In the PLSA topic model, the mixture components are word unigram models. • Let hw represent a word n-gram sequence where h stands for a word sequence of length n -1 ( h is empty for a word unigram) and w is an arbitrary word. • if we regard p(t|d) as the result of soft classification of the documents, then the count of n-gram hw with respect to topic t , by taking the weighting factor for datasets DG into consideration, can be express as:

  8. The modeling will be conducted based on mixture of counts instead of mixture of probabilities,

  9. Experiments

  10. Conclusions • The technique presented in this paper provides a PLSA-based n-gram weighting scheme. • The application of this weighting method can be far beyond language modeling tasks, and can be used to perform other domain-specific language-use statistics such as the term extraction and co-occurrence mining. • This technique can also be used as a smoothing method for domain-specific knowledge.

More Related