200 likes | 357 Views
STUDENT RESEARCH SYMPOSIUM 2005. Title: Strategically using Pairwise Classification to Improve Category Prediction Presenter: Pinar Donmez Advisors: Carolyn Penstein Ros é , Jaime Carbonell LTI, SCS Carnegie Mellon University. Outline. Problem Definition
E N D
STUDENT RESEARCH SYMPOSIUM 2005 Title: Strategically using Pairwise Classification to Improve Category Prediction Presenter: Pinar Donmez Advisors: Carolyn Penstein Rosé, Jaime Carbonell LTI, SCS Carnegie Mellon University
Outline • Problem Definition • Overview: Multi-label text classification methods • Motivation for Ensemble Approaches • Technical Details: Selective Concentration Classifiers • Formal Evaluation • Conclusions and Future Work
Problem Definition • Multi-label Text Classification (TC): 1-1 mapping of documents to pre-defined categories • Problems with TC: • Limited training data • Poor choice of features • Flaws in the learning process • Goal: Improve the predictions on the unseen data
Multi-label TC Methods • ECOC • Boosting • Pairwise Coupling and Latent Variable Approach
ECOC • Recall Problems with TC: • Limited training data • Poor choice of features • Flaws in the learning process • Result: Poor classification performance • ECOC: • Encode each class in a code vector • Encode each example in a code vector • Calculate the probability of each bit being 1 using i.e. decision trees, neural networks, etc. • Combine these prob’s in a vector • To classify the given ex, calculate the distance between this vector and each of the codewords of classes
Boosting • Main idea: Evolve a set of weights over the training set
Pairwise Coupling • K classes, N observations: • X = (f1, f2, f3, …, fp) is an observation with p features • K=2 case is generally easier than K>2 cases – since only one decision boundary has to be learned • Friedman’s rule for K-class problem (K>2): Max-wins Rule
Latent Variable Approach* • Usage of hidden variables that tell whether the corresponding model is good at capturing particular patterns of the data • Decision is based on the posterior probability: is the likelihood that ith model should be used for class prediction given input x where is the probability of y given input x and ith model and * Y. Liu, J. Carbonell, and R. Jin. A pairwise ensemble approach for accurate genre classification. In ECML ’03, 2003.
For each class Ci, Liu et.al., builds a structure like the following: • Liu et.al., builds a structure like above for each class • Compute the corresponding score of each test example • Assign the example to the class with the highest score
Intuition Behind Our Method • Multiple classes => single decision boundary is not powerful enough • Ensemble notion: • Partition the data into focused subsets • Learn a classification model on each subset • Combine the predictions of each model • What is the problem with ensemble techniques? • When the category space is large, time complexity to build models on subsets becomes intractable • Our method addresses this problem. But how?
Technical Details • Build one-vs-all classifiers iteratively • At each iteration choose which sub-classifiers to build based on an analysis of error distributions • Idea: Focus on the classes that are highly confusable • Similar to Boosting • Boosting modifies the weights of misclassified examples to penalize inaccurate models • In decision stage: • If a confusable class is chosen for prediction of a test example, predictions of the sub-classifiers for that class are also taken into account
Train K one-vs-allmodels B … F … H ConfusionMatrix A … D D-vs-{F and H} classifier A vs B classifier B F H Build A-vs-B and D-vs-{F and H} A D ….. Confusion Matrix Note: Continue to build sub-classifiers until either there is no need or you can not divide any further!
How to choose subclassifiers? • fi(λ) = λ*µi + (1- λ)*ơ2i • gi(β) = µi + β* ơ2i where µi = avg number of false positives for class i ơ2i = stdev of false positives for class i • Focus on classes which fi(λ) > T ( T _ predefined threshold) • For every i for which the above inequality is true: • Choose all classes j where C(i,j) > gi(β) • C(i,j) = entry in the confusion matrix where i is the predicted class and j is the true class • 3 parameters: λ , β, and T • Tuned on a held-out set
Analysis of error distribution for some classifiers I • Analysis on the 20newsgroups dataset: • These errors are more uniformly distributed • The avg number of false positives are not very high • Two criteria aren’t met: • Skewed error distribution • Large # of errors
Analysis of error distribution for some classifiers II • Common in all three: • Skewed distribution of errors (false +’s) • These peaks will form the sub-classifiers
Implications of our method • Objective: Obtain high accuracy by choosing a small set of sub-classifiers within a small number of iterations • Pros: • Strategically choosing sub-classifiers reduce training time compared to building one-vs-one classifiers • O(nlogn) number of classifiers on the average • Sub-classifiers are trained on more focused sets, so they are likely to do a better job • Cons: • We focus on the problematic classes to distinguish for obtaining sub-classifiers. Hence, performance might be hurt as we increase the number of iterations
Evaluation • Dataset: 20 newsgroups* • Evaluation on two versions: • Original 20 Newsgroups (19,997 documents evenly distributed across 20 classes) • Cleaned version (not include headers + stopwords + words occur only once) • Vocabulary size ~ 62,000 *J. Rennie and R. Rafkin, Improving Multiclass Text Classification with SVM. MIT, AI Memo AIM-2001-026. 2001
Comparison of Results I • Results are based on the evaluation of the cleaned version of 20newsgroups dataset • Selective Concentration did as comparably well as the Latent Variable Approach • Selective Concentration uses O(nlogn) classifiers on the average while Latent Variable Approach uses O(n2) classifiers
Comparison of Results II • Results are based on the original version of 20newsgroup data • Selective Concentration method is significantly better than the baseline • Difference between the number of classifiers in both methods are not very high
Conclusion and Future Work • We can achieve comparable accuracy with less training time by strategically selecting subclassifiers • O(nlogn) vs O(n2) • Continued formalization of how different error distributions affect the advantage of this approach • Application to semantic role labeling