Support Vector Machines Classification with A Very Large-scale Taxonomy

Support Vector Machines Classification with A Very Large-scale Taxonomy Advisor ：Dr. Hsu Presenter： Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma SIGKDD , 2004

Outline • Motivation • Objective • Introduction • Dataset characteristic • Complexity Analysis • Effectiveness Analysis • Experimental Settings • Conclusions • Personal Opinion

Motivation • very large-scale classification taxonomies • Hundreds of thousands of categories • Deep hierarchies • Skewed category distribution over documents • open question whether the state-of-the-art technologies in text categorization • evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories

Objective • scalability and effectiveness • a data analysis on the Yahoo! Taxonomy • development of a scalable system for large-scale text categorization • theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification • threshold tuning algorithms with respect to time complexity and accuracy of SVMs

Introduction • TC (Text categorization), SVMs, KNN, NB,… • in recent years, the scale of TC problems to become larger and larger • Answer this question from the views of scalability and effectiveness

SVM • flat SVMs , hierarchical SVMs • structure of the taxonomy tree

SVM

SVM • Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points

SVM

SVM • Multi-class classification • basically, SVMs can only solve binary classification problems • fit all binary sub-classifiers • one-against-all • N two-class (true class and false class) • one-against-one • N(N-1)/2 classifiers

SVM • be a set of n labeled training documents • linear discriminant function • a corresponding classification function as • margin of a weight vector

SVM • Optimal separation • soft-margin multiclass formulation

SVM

DATABASE-first characteristic • The full domain of the Yahoo! Directory • 292,216 categories • 792,601 documents

DATABASE-second characteristic • Over 76% of the Yahoo! Categories have fewer than 5 labeled documents • As “rare categories” increases at deeper hierarchy levels • 36% are rare categories at deep levels

DATABASE-third characteristic • many documents have multiple labels • average has 2.23 labels • the largest number of labels for a single document is 31

DATABASE • Yahoo! Directory into a training set and a testing set with a ratio of 7:3 • Remove those categories containing only one labeled document • 132,199 categories • 492,617 training documents 275,364 testing documents

Complexity and Effectiveness • Flat SVMs, with one-against-rest strategy • N is the number of training documents • M is the number of categories • denotes the average training time per SVM model • model

Complexity and Effectiveness • Hierarchical • mi is the number of categories defined at the i-th level • j is the size-based rank of the categories • nij is the number of training documents for the j-th category at the i-th level • ni1 is the number of training document for the most common category at the i-th level • is a level-specific parameter

Complexity and Effectiveness • was used to approximate the number of categories at the i-th level

Complexity and Effectiveness

Complexity and Effectiveness • For the testing phase of hierarchical SVMs • Pachinko-machine search: 從根部做起，每次從當前類中選擇一個最可能的子類打開，直到遇到葉子為止

Complexity of SVM Classification with Threshold Tuning

Complexity of SVM Classification with Threshold Tuning • SCut • Optimal performance of the classifier is obtained for the category • Fix the per-category thresholds when applying the classifier to new documents in the test set • RCut • Sort categories by score and assign YES to each of the t top-ranking categories

Effectiveness Analysis • Compared to scalability analysis, classification effectiveness is not as clear and predictable • be affected by many other factors • Potential problems of SVM • noisy, imbalanced • Can’t expect the performance of hierarchical SVM to be very good

Experimental Results • 10 machines, each with four 3GHz CPUs and 4 GB of memory

Experimental Results

Conclusions • Text categorization algorithms to very large problems, especially large-scale Web taxonomies

Conclusions • Drawback • Lower performance in deep level • Application • combine SVMs with concept hierarchical tree • Application to Text, or others domain • Pachinko-machine search… • Future Work • learn SVMs kernel to implement ?

Support Vector Machines Classification with A Very Large-scale Taxonomy

Support Vector Machines Classification with A Very Large-scale Taxonomy

Presentation Transcript

Support Vector Machines

Support Vector Machines

Support vector machines for classification

Classification / Regression Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Classification Part V: Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines