1 / 24

Text Categorization With Support Vector Machines: Learning With Many Relevant Features

Text Categorization With Support Vector Machines: Learning With Many Relevant Features. By Thornsten Joachims Presented By Meghneel Gore. Goal of Text Categorization. Classify documents into a number of pre-defined categories. Documents can be in multiple categories

Download Presentation

Text Categorization With Support Vector Machines: Learning With Many Relevant Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore

  2. Goal of Text Categorization • Classify documents into a number of pre-defined categories. • Documents can be in multiple categories • Documents can be in none of the categories

  3. Applications of Text Categorization • Categorization of news stories for online retrieval • Finding interesting information from the WWW • Guiding a user's search through hypertext

  4. Representation of Text • Removal of stop words • Reduction of word to its stem • Preparation of feature vector

  5. Representation of Text ....................... ...................... ...................... ...................... ...................... ...................... ..................... 2 Comput 1 Process 2 Buy 3 Memory .... This is a Document Vector

  6. What's Next... • Appropriateness of support vector machines for this application • Support vector machine theory • Conventional learning methods • Experiments • Results • Conclusions

  7. Why SVMs? • High dimensional input space • Few irrelevant features • Sparse document vectors • Text categorization problems are linearly separable

  8. Support Vector Machines Visualization of a Support Vector Machine

  9. Support Vector Machines • Structural risk minimization

  10. Support Vector Machines • We define a structure of hypothesis spaces Hi such that their respective VC dimensions di increases

  11. Support Vector Machines • Lemma [Vapnik, 1982] Consider hyperplanes As hypotheses

  12. Support Vector Machines If all example vectors are contained in A hypersphere of radius R and it is Required that

  13. Support Vector Machines • Then this set of hyperplane has a VC dimension d bounded by

  14. Support Vector Machines • Minimize • Such that

  15. Conventional Learning Methods • Naïve Bayes classifier • Rocchio algorithm • K-nearest Neighbors • Decision tree classifier

  16. Naïve Bayes Classifier • Consider a document vector with attributes a1, a2… an with target values v • Bayesian approach:

  17. Naïve Bayes Classifier • We can rewrite that using Bayes theorem as

  18. Naïve Bayes Classifier • Naïve Bayes method assumes that the attributes are independent

  19. Experiments • Datasets • Performance measures • Results

  20. Datasets • Reuters-21578 dataset • 9603 training examples • 3299 testing documents • Ohsumed Corpus • 10000 training documents • 10000 testing examples

  21. Performance Measures • Precision • Probability that a document predicted to be in class ‘x’ truly belongs to that class • Recall • Probability that a document belonging to class ‘x’ is classified into that class • Precision/recall breakeven point

  22. Results Precision/recall break-even point on Ohsumed dataset

  23. Results Precision/recall break-even point on Reuters dataset

  24. Conclusions • Introduces SVMs for text categorization • Theoretical and empirical evidence that SVMs are well suited for text categorization • Consistent improvement in accuracy over other methods

More Related