Xuan-Hieu Phan Le-Minh Nguyen Susumu Horiguchi

Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections Xuan-HieuPhan Le-Minh Nguyen Susumu Horiguchi GSIS, Tohoku University GSIS, JAIST GSIS, Tohoku University WWW 2008 NLG Seminar 2008/12/31 Reporter:Kai-JieKo

Many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness Motivation

Employ search engines to expand and enrich the context of data Previous works to overcome data sparseness

Employ search engines to expand and enrich the context of data • Time consuming! Previous works to overcome data sparseness

To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources Previous works to overcome data sparseness

To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources • Only used the user defined categories and concepts in those repositories, not general enough Previous works to overcome data sparseness

General framework

Must large and rich enough to cover words, concepts that are related • to the classification problem. • Wikipedia & MEDLINE are chosen in this paper. (a)Choose an universal data

Use topic oriented keywords to crawl Wikipedia with maximum depth of hyperlink 4 • 240MB • 71,968 documents • 882,376 paragraphs • 60,649 vocabulary • 30,492,305 words (a)Choose an universal data

Ohsumed : a test collection of medical journal abstracts to assist IR research • 156MB • 233,442 abstracts (a)Choose an universal data

(b)Doing topic analysis for the universal dataset

Using GibbsLDA++, a C/C++ implementation of LDA using Gibbs Sampling • The number of topics ranges from 10, 20 . . . to 100, 150, and 200 • The hyperparametersalpha and beta were set to 0.5 and 0.1, respectively (b)Doing topic analysis for the universal dataset

Hidden topics analysis for Wikipedia data

Hidden topics analysis for the Ohsumed-MEDLINE data

Words/terms in this dataset should be relevant to as many hidden • topics as possible. (c)Building a moderate size labeled training dataset

To transform the original data into a set of topics (d)Doing topic inference for training and future data

Sample Google search snippets

This show the sparseness of web snippets in that only small fraction of words are shared by the 2 or 3 different snippets Snippets word co-occurence

After doing inference and integration, snippets are more related in semantic way Shared topics among snippets after inference

Choose from different learning methods • Integrate hidden topics into the training, test, or future data • according to the data representation of the chosen learning • technique • Train the classifier on the integrated training data (e) Building the classifier

Domain disambiguation for Web search results • To classify Google search snippets into different domains, such as Business, Computers, Health, etc. • Disease classification for medical abstracts • Classifies each MEDLINEmedical abstract into one of five disease categories that are related to neoplasms, digestive system, etc. Evaluation

Obtain Google snippet as training and testing data, the search phrase of the two data are totally exclusive Domain disambiguation for Web search results

The result of doing 5-fold cross validation on the training data • Reduce 19% of error on average Domain disambiguation for Web search results

Domain disambiguation for Web search results

Disease Classification for Medical Abstracts with MEDLINE Topics The proposed method requires only 4500 training data to reach the accuracy of the baseline which uses 22500 training data!

Advantages of proposed framework: • Agood method to classify sparse and previous unseen data • Utilizing the large universal dataset • Expanding the coverage of the classifier • Topics coming from external data cover a lot of terms/words that do not exist in training dataset • Easy to implement • Only have to prepare a small set of labeled training example to attain high accuracy Conclusion

Xuan-Hieu Phan Le-Minh Nguyen Susumu Horiguchi