360 likes | 508 Views
Support Vector Machines Classification with A Very Large-scale Taxonomy. Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. SIGKDD , 2004. Outline. Motivation Objective Introduction
E N D
Support Vector Machines Classification with A Very Large-scale Taxonomy Advisor :Dr. Hsu Presenter: Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma SIGKDD , 2004
Outline • Motivation • Objective • Introduction • Dataset characteristic • Complexity Analysis • Effectiveness Analysis • Experimental Settings • Conclusions • Personal Opinion
Motivation • very large-scale classification taxonomies • Hundreds of thousands of categories • Deep hierarchies • Skewed category distribution over documents • open question whether the state-of-the-art technologies in text categorization • evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories
Objective • scalability and effectiveness • a data analysis on the Yahoo! Taxonomy • development of a scalable system for large-scale text categorization • theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification • threshold tuning algorithms with respect to time complexity and accuracy of SVMs
Introduction • TC (Text categorization), SVMs, KNN, NB,… • in recent years, the scale of TC problems to become larger and larger • Answer this question from the views of scalability and effectiveness
SVM • flat SVMs , hierarchical SVMs • structure of the taxonomy tree
SVM • Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points
SVM • Multi-class classification • basically, SVMs can only solve binary classification problems • fit all binary sub-classifiers • one-against-all • N two-class (true class and false class) • one-against-one • N(N-1)/2 classifiers
SVM • be a set of n labeled training documents • linear discriminant function • a corresponding classification function as • margin of a weight vector
SVM • Optimal separation • soft-margin multiclass formulation
DATABASE-first characteristic • The full domain of the Yahoo! Directory • 292,216 categories • 792,601 documents
DATABASE-second characteristic • Over 76% of the Yahoo! Categories have fewer than 5 labeled documents • As “rare categories” increases at deeper hierarchy levels • 36% are rare categories at deep levels
DATABASE-third characteristic • many documents have multiple labels • average has 2.23 labels • the largest number of labels for a single document is 31
DATABASE • Yahoo! Directory into a training set and a testing set with a ratio of 7:3 • Remove those categories containing only one labeled document • 132,199 categories • 492,617 training documents 275,364 testing documents
Complexity and Effectiveness • Flat SVMs, with one-against-rest strategy • N is the number of training documents • M is the number of categories • denotes the average training time per SVM model • model
Complexity and Effectiveness • Hierarchical • mi is the number of categories defined at the i-th level • j is the size-based rank of the categories • nij is the number of training documents for the j-th category at the i-th level • ni1 is the number of training document for the most common category at the i-th level • is a level-specific parameter
Complexity and Effectiveness • was used to approximate the number of categories at the i-th level
Complexity and Effectiveness • For the testing phase of hierarchical SVMs • Pachinko-machine search: 從根部做起,每次從當前類中選擇一個最可能的子類打開,直到遇到葉子為止
Complexity of SVM Classification with Threshold Tuning • SCut • Optimal performance of the classifier is obtained for the category • Fix the per-category thresholds when applying the classifier to new documents in the test set • RCut • Sort categories by score and assign YES to each of the t top-ranking categories
Effectiveness Analysis • Compared to scalability analysis, classification effectiveness is not as clear and predictable • be affected by many other factors • Potential problems of SVM • noisy, imbalanced • Can’t expect the performance of hierarchical SVM to be very good
Experimental Results • 10 machines, each with four 3GHz CPUs and 4 GB of memory
Conclusions • Text categorization algorithms to very large problems, especially large-scale Web taxonomies
Conclusions • Drawback • Lower performance in deep level • Application • combine SVMs with concept hierarchical tree • Application to Text, or others domain • Pachinko-machine search… • Future Work • learn SVMs kernel to implement ?