Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2Janez Brank,2 Marko Grobelnik21 Microsoft Research, Cambridge, UK2 Jožef Stefan Institute, Ljubljana, Slovenia

Introduction • Feature selection in the context of text categorization • Comparing different feature ranking schemes • Characterizing feature rankings based on their sparsity behavior • Sparsity defined as the average number of different words in a document (after feature selection removed some words)

Feature Weighting Schemes • Odds ratioOR(t) = log[odds(t|c) / odds(t|c)] • Information gainIG(t; c) = entropy(c) – entropy(c|t) • 2-statistic2(t) = N (NtcNtc – NtcNct)2 / [NcNcNtNt]N = number of all documents; Ntc = number of documents from class c containing term t, etc.Numerator equals 0 if t and c are independent. • Robertson – Sparck-Jones weightingRSJ(t) = log[(Ntc+0.5) (Ntc+0.5) / (Nct+0.5)(Ntc+0.5)](very similar to odds ratio)

Feature Weighting Schemes • Weights based on word frequencyDF = document frequency (no. of documents containing the word; this ranking suggests to use the most common words)IDF = inverse document frequency (use the least common words)

Feature Weighting Schemes • Weights based on a linear classifier (w, b) prediction(d) = sgn[b + Sterm tiwiTF(ti, d)] • If a weight wi is close to 0, the term ti has little influence on the predictions. • If it is not important for predictions, it is probably not important for learning either. • Thus, use |wi| as the score of a the term ti. • We use linear models trained using SVM and perceptron. • It might be practical to train the model on a subset of the full training set only (e.g. ½ or ¼ of the full training set, etc.).

Characterization of Feature Rankings in terms of Sparsity • We have a reatively good understanding of feature rankings based on odds ratio, information gain, etc., because they are based on explicit formulas for feature scores • How to better understand the rankings based on linear classifiers? • Let “sparsity” be the average number of different words per document, after some feature selection has been applied. • Equivalently: the avg. number of nonzero components per vector representing the document. • This has direct ties to memory consuption, as well as to CPU time consumption for computing norms, dot products, etc. • We can plot the “sparsity curve” showing how sparsity grows as we add more and more features from a given ranking.

Sparsity Curves

Sparsity as the independent variable • When discussing and comparing feature rankings, we often use the number of features as the independent variable. • “What is the performance when using the first 100 features?” etc. • Somewhat unfair towards rankings that prefer (at least initially) less frequent features, such as odds ratio • Sparsity is much more directly connected to memory and CPU time requirements • Thus, we propose the use of sparsity as the independent variable when comparing feature rankings.

Performance as a function of the number of features(Naïve Bayes, 16 categories of RCV2)

Performance as a function of sparsity

Sparsity as a cutoff criterion • Each category is treated as a binary classification problem (does the document belong to category c or not?) • Thus, a feature ranking method produces one ranking per category • We must choose how many of the top ranked features to use for learning and classification • Alternatively, we can define the cutoff in terms of sparsity. • The best number of features can vary greatly from one category to another • Does the best sparsity vary less between categories? • Suppose we want a constant number of features for each category. Is it better to use a constant sparsity for each category?

Results

Conclusions • Sparsity is an interesting and useful concept • As a cutoff criterion, it is not any worse, and is often a little better, than the number of features • It offers more direct control over memory and CPU time consumption • When comparing feature selection methods, it is not biased in favour of methods which prefer more common features

Future work • Characterize feature ranking schemes in terms of other characteristics besides sparsity curves • E.g. cumulative information gain: how the sum of IG(t; c) over the first k terms t of the feature ranking grows with k. • The goal: define a set of characteristic curves that would explain why some feature rankings (e.g. SVM-based) are better than others. • If we know the characteristic curves of a good feature ranking, we can synthesize new rankings with approximately the same characteristic curves • Would they also perform comparatively well? • With a good set of feature characteristics, we might be able to take the approximate characteristics of a good feature ranking and then synthesize comparably good rankings on other classes or datasets. • (Otherwise it can be expensive to get a really good feature ranking, such as the SVM-based one.)

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

Presentation Transcript

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM

Term weighting and vector representation of text

Text Classification

Weighting Schemes

Lecture 3 : Term Weighting

A Balanced Ensemble Approach to Weighting Classifiers for Text Classification

TEXT CLASSIFICATION

6 Classification Schemes

Text Classification

Term Weighting approaches in automatic text retrieval.

Text Classification

Text Classification

Term Weighting Schemes for Question Categorization

Term weighting and Vector space retrieval

Text Classification

Classification Text

Text Classification

A Balanced Ensemble Approach to Weighting Classifiers for Text Classification

TEXT CLASSIFICATION

Analysis of Text Classification Algorithms A Review