140 likes | 273 Views
Sparsity Analysis of Term Weighting Schemes and Application to Text Classification. Nataša Milić-Frayling, 1 Dunja Mladenić, 2 Janez Brank, 2 Marko Grobelnik 2 1 Microsoft Research, Cambridge, UK 2 Jo žef Stefan Institute, Ljubljana, Slovenia. Introduction.
E N D
Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2Janez Brank,2 Marko Grobelnik21 Microsoft Research, Cambridge, UK2 Jožef Stefan Institute, Ljubljana, Slovenia
Introduction • Feature selection in the context of text categorization • Comparing different feature ranking schemes • Characterizing feature rankings based on their sparsity behavior • Sparsity defined as the average number of different words in a document (after feature selection removed some words)
Feature Weighting Schemes • Odds ratioOR(t) = log[odds(t|c) / odds(t|c)] • Information gainIG(t; c) = entropy(c) – entropy(c|t) • 2-statistic2(t) = N (NtcNtc – NtcNct)2 / [NcNcNtNt]N = number of all documents; Ntc = number of documents from class c containing term t, etc.Numerator equals 0 if t and c are independent. • Robertson – Sparck-Jones weightingRSJ(t) = log[(Ntc+0.5) (Ntc+0.5) / (Nct+0.5)(Ntc+0.5)](very similar to odds ratio)
Feature Weighting Schemes • Weights based on word frequencyDF = document frequency (no. of documents containing the word; this ranking suggests to use the most common words)IDF = inverse document frequency (use the least common words)
Feature Weighting Schemes • Weights based on a linear classifier (w, b) prediction(d) = sgn[b + Sterm tiwiTF(ti, d)] • If a weight wi is close to 0, the term ti has little influence on the predictions. • If it is not important for predictions, it is probably not important for learning either. • Thus, use |wi| as the score of a the term ti. • We use linear models trained using SVM and perceptron. • It might be practical to train the model on a subset of the full training set only (e.g. ½ or ¼ of the full training set, etc.).
Characterization of Feature Rankings in terms of Sparsity • We have a reatively good understanding of feature rankings based on odds ratio, information gain, etc., because they are based on explicit formulas for feature scores • How to better understand the rankings based on linear classifiers? • Let “sparsity” be the average number of different words per document, after some feature selection has been applied. • Equivalently: the avg. number of nonzero components per vector representing the document. • This has direct ties to memory consuption, as well as to CPU time consumption for computing norms, dot products, etc. • We can plot the “sparsity curve” showing how sparsity grows as we add more and more features from a given ranking.
Sparsity as the independent variable • When discussing and comparing feature rankings, we often use the number of features as the independent variable. • “What is the performance when using the first 100 features?” etc. • Somewhat unfair towards rankings that prefer (at least initially) less frequent features, such as odds ratio • Sparsity is much more directly connected to memory and CPU time requirements • Thus, we propose the use of sparsity as the independent variable when comparing feature rankings.
Performance as a function of the number of features(Naïve Bayes, 16 categories of RCV2)
Sparsity as a cutoff criterion • Each category is treated as a binary classification problem (does the document belong to category c or not?) • Thus, a feature ranking method produces one ranking per category • We must choose how many of the top ranked features to use for learning and classification • Alternatively, we can define the cutoff in terms of sparsity. • The best number of features can vary greatly from one category to another • Does the best sparsity vary less between categories? • Suppose we want a constant number of features for each category. Is it better to use a constant sparsity for each category?
Conclusions • Sparsity is an interesting and useful concept • As a cutoff criterion, it is not any worse, and is often a little better, than the number of features • It offers more direct control over memory and CPU time consumption • When comparing feature selection methods, it is not biased in favour of methods which prefer more common features
Future work • Characterize feature ranking schemes in terms of other characteristics besides sparsity curves • E.g. cumulative information gain: how the sum of IG(t; c) over the first k terms t of the feature ranking grows with k. • The goal: define a set of characteristic curves that would explain why some feature rankings (e.g. SVM-based) are better than others. • If we know the characteristic curves of a good feature ranking, we can synthesize new rankings with approximately the same characteristic curves • Would they also perform comparatively well? • With a good set of feature characteristics, we might be able to take the approximate characteristics of a good feature ranking and then synthesize comparably good rankings on other classes or datasets. • (Otherwise it can be expensive to get a really good feature ranking, such as the SVM-based one.)