270 likes | 405 Views
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. Xuan-Hieu Phan Le-Minh Nguyen Susumu Horiguchi GSIS, Tohoku University GSIS, JAIST GSIS, Tohoku University WWW 2008 NLG Seminar 2008/12/31 Reporter:Kai-Jie Ko.
E N D
Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections Xuan-HieuPhan Le-Minh Nguyen Susumu Horiguchi GSIS, Tohoku University GSIS, JAIST GSIS, Tohoku University WWW 2008 NLG Seminar 2008/12/31 Reporter:Kai-JieKo
Many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness Motivation
Employ search engines to expand and enrich the context of data Previous works to overcome data sparseness
Employ search engines to expand and enrich the context of data • Time consuming! Previous works to overcome data sparseness
To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources Previous works to overcome data sparseness
To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources • Only used the user defined categories and concepts in those repositories, not general enough Previous works to overcome data sparseness
Must large and rich enough to cover words, concepts that are related • to the classification problem. • Wikipedia & MEDLINE are chosen in this paper. (a)Choose an universal data
Use topic oriented keywords to crawl Wikipedia with maximum depth of hyperlink 4 • 240MB • 71,968 documents • 882,376 paragraphs • 60,649 vocabulary • 30,492,305 words (a)Choose an universal data
Ohsumed : a test collection of medical journal abstracts to assist IR research • 156MB • 233,442 abstracts (a)Choose an universal data
Using GibbsLDA++, a C/C++ implementation of LDA using Gibbs Sampling • The number of topics ranges from 10, 20 . . . to 100, 150, and 200 • The hyperparametersalpha and beta were set to 0.5 and 0.1, respectively (b)Doing topic analysis for the universal dataset
Words/terms in this dataset should be relevant to as many hidden • topics as possible. (c)Building a moderate size labeled training dataset
To transform the original data into a set of topics (d)Doing topic inference for training and future data
This show the sparseness of web snippets in that only small fraction of words are shared by the 2 or 3 different snippets Snippets word co-occurence
After doing inference and integration, snippets are more related in semantic way Shared topics among snippets after inference
Choose from different learning methods • Integrate hidden topics into the training, test, or future data • according to the data representation of the chosen learning • technique • Train the classifier on the integrated training data (e) Building the classifier
Domain disambiguation for Web search results • To classify Google search snippets into different domains, such as Business, Computers, Health, etc. • Disease classification for medical abstracts • Classifies each MEDLINEmedical abstract into one of five disease categories that are related to neoplasms, digestive system, etc. Evaluation
Obtain Google snippet as training and testing data, the search phrase of the two data are totally exclusive Domain disambiguation for Web search results
The result of doing 5-fold cross validation on the training data • Reduce 19% of error on average Domain disambiguation for Web search results
Disease Classification for Medical Abstracts with MEDLINE Topics The proposed method requires only 4500 training data to reach the accuracy of the baseline which uses 22500 training data!
Advantages of proposed framework: • Agood method to classify sparse and previous unseen data • Utilizing the large universal dataset • Expanding the coverage of the classifier • Topics coming from external data cover a lot of terms/words that do not exist in training dataset • Easy to implement • Only have to prepare a small set of labeled training example to attain high accuracy Conclusion