200 likes | 320 Views
Text Classification Using Stochastic Keyword Generation. Cong Li , Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003. Outline. Introduction Text Classification Using Stochastic Keyword Generation Experimental Results Conclusion and Future Work. Introduction
E N D
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003
Outline • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work
Introduction • Supervised Text Classification • Question: how to use additional data in training to improve the performance? • New Text Classification Problem • Summaries of texts are available in training, which are more indicative of contents • Note: Summaries are not available in classification • Example: classification at a help desk
Example • Email • When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. • Categories • Empty Outlook Message • Cannot Open Word File • Summary • receive emails; some emails have no subject and message body
Outline • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work
New Text Classification Problem • Spaces • Users’ emails: space X • Categories: space Y • Engineers’ summaries (for training): space S • Assumption • Summaries are much easier to be classified
Text Classification Using SKG Conventional Text Classification Text Classification Using SKG email: x X When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. email: x X When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. SKG classification probability vector: (x) (email 0.75, receive 0.68, subject 0.45, body 045, … ) classification category: y Y Empty Outlook Message category: y Y Empty Outlook Message
Stochastic Keyword Generation • Generating Keywords from a Given Text • Stochastic Keyword Generation (SKG) • Generate keywords and their conditional probabilities of occurrence given the text • Example emails 0.75 receive 0.68 subject 0.45 body 0.45 When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. Stochastic Keyword Generation
SKG Model new text x
Model for Each Keyword new text x
Learning Using SKG SKG classification
Outline • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work
Data in Experiments • Data of the Help Desk of Microsoft • 2517 texts from 52 categories • About 10000 unique words in texts • About 1500 unique words in summaries • Conducted stopword removal, but not stemming • Training/Test Split • 5-fold cross validation
Experimental Settings • Classifiers • Linear SVM (Platt 1998; Dumais et al. 1998) • Perceptron algorithm with margins (PAM) (Li et al. 2002) • Methods • Text classification using SKG • Methods for comparison: • Prior • Texts for training • Summaries for training • (text+summary)s for training • Deterministic keyword generation (DKG)
Discussion email: x X When getting emails I get a notice that an email has been received but when I try to view the message it is blank. I have also tried to run the repair program off the install disk but that it did not take care of the problem. SKG summary: x X receive emails; some emails have no subject and message body probability vector: (x) (email 0.75, receive 0.68, subject 0.45, body 045, … ) classification category: y Y Empty Outlook Message
Outline • Introduction • Text Classification Using Stochastic Keyword Generation • Experimental Results • Conclusion and Future Work
Conclusion and Future Work • Conclusion • Text classification using SKG significantly outperforms the methods without using it • Future Work • Theoretical analysis of the problem and the proposed method • Applied in different settings