240 likes | 343 Views
Text Categorization With Support Vector Machines: Learning With Many Relevant Features. By Thornsten Joachims Presented By Meghneel Gore. Goal of Text Categorization. Classify documents into a number of pre-defined categories. Documents can be in multiple categories
E N D
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore
Goal of Text Categorization • Classify documents into a number of pre-defined categories. • Documents can be in multiple categories • Documents can be in none of the categories
Applications of Text Categorization • Categorization of news stories for online retrieval • Finding interesting information from the WWW • Guiding a user's search through hypertext
Representation of Text • Removal of stop words • Reduction of word to its stem • Preparation of feature vector
Representation of Text ....................... ...................... ...................... ...................... ...................... ...................... ..................... 2 Comput 1 Process 2 Buy 3 Memory .... This is a Document Vector
What's Next... • Appropriateness of support vector machines for this application • Support vector machine theory • Conventional learning methods • Experiments • Results • Conclusions
Why SVMs? • High dimensional input space • Few irrelevant features • Sparse document vectors • Text categorization problems are linearly separable
Support Vector Machines Visualization of a Support Vector Machine
Support Vector Machines • Structural risk minimization
Support Vector Machines • We define a structure of hypothesis spaces Hi such that their respective VC dimensions di increases
Support Vector Machines • Lemma [Vapnik, 1982] Consider hyperplanes As hypotheses
Support Vector Machines If all example vectors are contained in A hypersphere of radius R and it is Required that
Support Vector Machines • Then this set of hyperplane has a VC dimension d bounded by
Support Vector Machines • Minimize • Such that
Conventional Learning Methods • Naïve Bayes classifier • Rocchio algorithm • K-nearest Neighbors • Decision tree classifier
Naïve Bayes Classifier • Consider a document vector with attributes a1, a2… an with target values v • Bayesian approach:
Naïve Bayes Classifier • We can rewrite that using Bayes theorem as
Naïve Bayes Classifier • Naïve Bayes method assumes that the attributes are independent
Experiments • Datasets • Performance measures • Results
Datasets • Reuters-21578 dataset • 9603 training examples • 3299 testing documents • Ohsumed Corpus • 10000 training documents • 10000 testing examples
Performance Measures • Precision • Probability that a document predicted to be in class ‘x’ truly belongs to that class • Recall • Probability that a document belonging to class ‘x’ is classified into that class • Precision/recall breakeven point
Results Precision/recall break-even point on Ohsumed dataset
Results Precision/recall break-even point on Reuters dataset
Conclusions • Introduces SVMs for text categorization • Theoretical and empirical evidence that SVMs are well suited for text categorization • Consistent improvement in accuracy over other methods