320 likes | 464 Views
On feature distributional clustering for text categorization. Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001. Plan of talk. A representation of a new text categorization technique based on: Distributional Clustering Support Vector Machine (SVM)
E N D
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001
Plan of talk • A representation of a new text categorization technique based on: • Distributional Clustering • Support Vector Machine (SVM) • Comparative evaluation of the new technique wrt previous work (Dumais et. al.) that used • Mutial Information (MI) feature selection
Main results • The evaluation is performed on two benchmark corpora: • Reuters • 20 Newsgroups (20NG) • The result is that the new technique works better than the known one on 20NG. • But it isn’t better on Reuters. • Possible reasons for such a behavior will be discussed.
Text categorization • A fundamental problem of splitting a large text corpus into a number of semantic categories (predefined). • We are dealing with its supervised version. • The problem has many real-world applications. • Search engines. • Helpdesks.
Text representation • A standard approach: Bag-Of-Words. • A document as a list of words it contains. • Much more sophisticated method: distributional clusters. • A word is represented as a distribution over the categories. • The words are then clustered to k clusters. • Details will go later on.
Support Vector Machines • A modern inductive classification method. • Proposed by Vapnik. • Usually shows its advantage over other learning schemes such as • K Nearest Neighbors • Naïve Bayes
Corpora • A corpus is a large collection of documents. • We’ve checked our algorithms on two well-known corpora: • Reuters (ModApte split): 7063 articles in the training set, 2742 articles in the test set. 118 categories. • 20 Newsgroups: 20000 articles. 20 categories.
Multi-labeling vs. uni-labeling • Multi-labeled corpus: many articles belong to a number of categories. • Example: Reuters (15.5% are multi-labeled documents) • Uni-labeled corpus: each article belongs to only one category. • It has been thought so about 20 newsgroups. But in fact it contains 4.5% multi-labeled documents.
Related results • Dumais et al. (1998): SVM with simple feature selection on Reuters. • Best known result: 92.0% of breakeven over 10 largest categories. • Baker and McCallum (1998): Distributional clustering + Naïve Bayes on 20NG. • 85.7% of accuracy (uni-labeled scheme).
Related results (contd.) • Joachims (1996): Rocchio algorithm on Naïve Bayes. • Best known result on 20NG (uni-labeled approach): 90.3% of accuracy. • Slonim and Tishby (2000): Information Bottleneck method. • Used in our work.
Related results (contd.) • Zhang and Oles (2001): comparative study of linear classification techniques wrt. text categorization over different corpora. • SVM is always better.
The case of our study corpus <> MI feature selection Distributional Clustering Support Vector Machine result
Feature selection via Mutual Information • On training set, choose N words which contribute maximum for separating the categories. • The contribution is in terms of Mutual Information: • For each word w and each category c.
Feature selection via MI (contd.) • For each category we build a list of N most contributing words. • For example (on 20 Newsgroups): • sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, circuits, … • rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …
Distributional Clustering • Was proposed by Pereira, Tishby and Lee (1993). • Its generalization is called Information Bottleneck (Tishby, Pereira, Bialek 1999). • In our case, each word (in the training set) is represented as a distribution over categories it appears in. • Each word w is then clustered into a pseudo-word .
Distributional Clustering (contd.) • The idea is to maximize the Mutual Information wrt. the partition under a constraint on . • The solution is in the following equation: • Where Z is the normalization factor, β is an annealing parameter.
Deterministic Annealing • A powerful clustering method, proposed by Rose et. al. (1998). • The approach is “top-down”: • Start with one cluster with low β (“high temperature”). • Split it while lowering the “temperature” until reaching a stable stage.
Vector space in our experiment • In MI feature selection technique: • documents are projected onto N most contributing words. • In Information Bottleneck technique: • Firstly words are grouped into clusters, • And then documents are projected onto the pseudo-words. • So, documents are vectors whose elements are numbers of occurrences of “best” words (1) or pseudo-words (2).
Support Vector Machines • A modern classification technique. • The classification is based on the border examples only: • We used linear SVM (the SVMlight packet by Joachims). Support Vectors
Multi-labeled setting • MI feature selection (or distributional clustering) on the training and test sets. • For each category we train a binary classifier on the training set. • On each document in the test set we run all the classifiers. • The document is related to all the categories whose classifiers accepted it.
Uni-labeled setting • The same as in multi-labeled one. • `` `` `` • `` `` `` • The document is related to the (one) category whose classifier accepted it with maximal score.
Evaluating the results • Multi-labeled: each document’s labels should be identical to the classification results. • Precision/Recall scheme. • Uni-labeled: the classification result should be included in the set of document’s labels. • Accuracy measure (number of hits).
The setup of our experiment • To reproduce the results achieved by Dumais et. al., we choose k = 300 (number of “best” words and number of clusters). • Since we wanted to compare 20NG and Reuters (ModApte split: ¾ is training set and ¼ is test set) we used 4-fold cross-validation on 20NG.
Parameter tuning • We have 2 major parameters: • Number of clusters or “best” words (k). • SVM parameters (C and J in SVMlight). • For each experiment, k is fixed. • To perform a “fair” experiment, we tune C and J on the training set, splitting it to train-train and train-validation sets. • Then we run the experiment with the best parameters fixed at the previous stage.
Unfair parameter tuning • Suppose we want to compare results of two experiments A and B. • And we see that the result of A is better than the one of B. • So, we run B with unfair parameter tuning • Parameters are tuned right on the test set. • This will assure us that it’s impossible to achieve the result of A with the setting of B.
Our result on 20 Newsgroups • Multi-labeled setting (break-even point): • Clustering: 88.6±0.3% (k = 300) • MI feature selection: 77.7±0.5% (k = 300) • `` `` : 86.3±0.4% (k = 15000) • Uni-labeled setting (accuracy measure): • Clustering: 91.0±0.3% (k=300) • MI feature selection: 85.1±0.5% (k = 300) • `` `` : 91.2±0.4% (k = 15000) • Parameter tuning of the MI-based experiments is unfair.
Our result on Reuters • It makes no sense to speak about uni-labeled setting on Reuters. • Because it’s a multi-labeled corpus. • Multi-labeled setting (break-even point): • Clustering: 91.6% (k = 300) –unfair • MI feature selection: 92.0% (k = 300) • The results are achieved on 10 largest categories of Reuters.
Discussion of the results • So, we see that our technique (clustering) works better than MI on 20NG and almost the same (a little worse) on Reuters. • What can be the explanation? • Reuters is manually labeled while 20NG is “naturally” labeled. • Hypothesis: Reuters was labeled only according to a few keywords that appeared in the documents.
Confirmation of our suggestion • We tried to decrease the number of features selected by MI technique, on both Reuters and 20NG. • We saw that • On 20NG the results decreased sharply, • On Reuters the results remained the same. • So, just a few words are enough to categorize documents of Reuters, while in 20NG we need much more words.
Conclusion • There’re corpora for which simple methods work well. • Such as Reuters: selection of just a few features solves the problem of text categorization. • For other corpora (such as 20NG) a sophisticated method of distributional clustering helps a lot. • Future work: to evaluate our technique on other corpora.