370 likes | 530 Views
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM. Lan Man 3 Nov, 2004. Synopsis. Purpose of this work Experiment Design Results and Discussions Conclusions. Purpose of this work.
E N D
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004
Synopsis • Purpose of this work • Experiment Design • Results and Discussions • Conclusions
Purpose of this work • Text categorization, the task of assigning unlabelled documents into predefined categories • kNN, Decision Tree, Neural Network, Naïve Bayes, Linear Regression, SVM, Perceptron, Rocchio and etc. • Classifier Committees, Bagging, Boosting • SVM has been shown rather good performance
Purpose of this work (Cont.) • Does the difference of performance come from different text representations or from different kernel functions of SVM ? • [Leopold, 2002] points out that it is the text representation schemes which dominate the performance of text categorization rather than the kernel functions of SVM in text categorization domain.
Purpose of this work (Cont.) • Therefore, choosing an appropriate term weighting scheme is more important than choosing and tuning kernel functions of SVM for text categorization task. • However, the previous works are not enough to draw a definite conclusion that which term weighting scheme is better for SVM.
Purpose of this work (Cont.) • Different Data Preparation: Stemming, stop-words, feature selection, term weighting schemes • Different Data Collection: Reuters (whole, top 10, top 90, partial top 10) • Different Classifiers with various parameters • Different performance evaluation
Purpose of this work (Cont.) • Our study focuses on the various term weighting schemes for SVM. • The reason why choose linear kernel function: • It is simple and fast • Based on our preliminary experiments and previous studies, linear is better than non-linear models even handling high dimensional data • Comparison of term weighting schemes rather than the choosing and tuning of kernel functions is our current work
Term Weighting Schemes • 10 different term weighting schemes selected due to their reported superior classification results or their typical representation when using SVM • They are: binary, tf, logtf, ITF, idf, tf.idf, logtf.idf, tf.idf-prob, tf.chi, tf.rf
Term Weighting Schemes • The following four are related with term frequency alone: • binary : 1 for term present and 0 for term absent in a vector • tf : # of times a term occurs in a document • logtf : 1 + log(tf), where log is to mend unfavorable linearity • ITF : 1-r/(r+tf), usually r=1 (inverse term frequency presented by Leopold)
Term Weighting Schemes • The following four are related with idf factor: • idf : log(N/ni), where N is the # of docs, ni the # of docs which contain term ti • tf.idf : the widely-used term representation • logtf.idf : (1+logtf).idf • tf.idf-prob : idf-prob = log((N-ni)/ni), is an approximate representation of term relevance weight, also called probabilistic idf
Term Weighting Schemes • tf.chi : as a representative of combining feature selection measures (chi^2, information gain, odds-ratio, gain ratio and etc.) • tf.rf : newly proposed by us; relevant frequency (rf) = log(1+ni/ni_), ni is the # of docs which contain term ti, and ni_ is the # of negative docs which contain term ti
Analysis of Discriminating Power Different Formula idf = log (N/(a+c)) chi^2= N*((ad-bc)^2) / ((a+c)(b+d)(a+b)(c+d)) idf-prob = log((b+d)/(a+c)) rf = log(2+a/c) To avoid c=0, we set rf = log(2+a/_max(1,c)) N=a+b+c+d, d>>a, b, c
Analysis of Discriminating Power Assume the six terms have the same tf value. The first three terms have the same idf1, and the last three ones have the same idf2. idf = log ( N/(a+c) ) idf1 > idf2 N = a+b+c+d
Analysis of Discriminating Power Given idf1<idf2, the classical tf.idf gives more weight to the first three terms than the last three terms. But t1 has more discriminating power than t2 and t3 in positive category. tf.idf representation may lose its discriminating power. We propose new factor relevance frequency rf = log (1+(a+c)/c).
Benchmark Data Collection 1 • Data Collection 1 – Reuters-21578 • top 10, 7193 trainings and 2787 tests • Remove stop words (292), punctuation and numbers • Porter stemming performed • Minimal term length is 4 • Top p features per category selected by using chi-square metric, p = {50, 150, 300, 600, 900, 1200, 1800, 2400, All} • Null vectors are removed • 15959 terms
Benchmark Data Collection 2 • Data Collection 2 – 20 Newsgroups • 200 trainings and 100 tests per category, 20 categories; 4000 trainings and 2000 tests • Remove stop words, punctuation and numbers • Minimal term length is 4 • Top p features per category selected by using chi-square metric, p = {5, 25, 50, 75, 100, 150, 200, 250, 300, 400, 500} • Null vectors are removed • 50088 terms
Two Data Sets Comparison • Reuters : Skewed category distribution; Among 7193 trainings, the most common category (earn) contains 2877 trainings (40%); while 80% of the categories have less than 7.5% training samples. • 20 Newsgroups : uniform distribution; We selected the first 200 trainings and the first 100 tests per category based on the partition -- 20 news-bydate. 200 positive samples and 3800 negative samples per each chosen category,.
Performance Measure • Precision=true positive/(true positive + false positive) • Recall = true positive / (true positive + false negative) • Precision/Recall breakeven point : tune the classifier parameter and yield the hypothetical point at which precision and recall are equal.
McNemar’s significance test • Two classifier f1 and f2 are based on two term weighting schemes. • Contingency table
McNemar’s Significance Test • If the two classifiers have the same error rate, then n10 = n01; • chi = (|n10-n01|-1)^2 / (n01+n10) is approximately distributed as chi^2 with 1 degree of freedom; • If the null hypothesis is correct, then the probability that this quantity is greater than chi^2(1, 0.99) = 6.64 is less than 0.01(significant level \alpha).
Results on the Reuters Observation: The break-even point increases as the #-features grows. All schemes reach a maxi value at the full vocabulary and the best BEP is 0.9272 by tr.rf scheme
Significance Tests Results on Reuters ‘<’ and ‘<<’ denote better than at significance level 0.01 and 0.001 respectively; ‘{}’ denote no significant difference
Results on the 20 Newsgroups Observation: The tends are not monotonic increase. All schemes reach a maxi value at a small vocabulary range from 1000 to 3000. The best BEP is 0.6743 by tr.rf scheme
Discussion • To achieve high break-even point, different number of vocabularies are required for the two data sets. Reuters : diverse subject matters per category with overlapping vocabularies and large vocabularies are required; 20 Newsgroups : single narrow subject with limited vocabularies and 50-100 vocabularies per category is sufficient.
Discussion • tf.rf shows significant better performance than other schemes on the two different data sets. • Both of the best break-even points are achieved by using the tf.rf scheme no matter on the skewed or uniform category distribution. • The significance tests support this observation.
Discussion • There is no observation that idf factor can add the term’s discriminating power for text categorization when combined with tf factor. • Reuters : tf, logtf and ITF achieve higher break-even point than schemes combined with idf –tf.idf, logtf.idf and tf.idf-prob. • 20 Newsgroups : difference between tf alone or idf alone or both are not significant • Hence, idf factor gives no discriminating power or even decrease the term’s discriminating power.
Discussion • Binary and tf.chi show consistently worse performance than other schemes. • Binary scheme ignores the frequency information which is crucial to the representation of the content of the document • Feature selection metrics, chi^2, involve d value where d>>a, b, and c . d value dominates chi^2 value and may not appropriately express the term’s discriminating power.
Discussion • Specially, ITF scheme has comparable good performance in the two data sets but still worse than tf.rf scheme
Conclusions • Our newly proposed tf.rf shows significant better performance than other schemes based on the two widely-used data sets with different category distributions • Schemes related with tf alone, tf, logtf, ITF show rather good performance while still worse than the tf.rf scheme
Conclusions • The idf and chi factor, taking the collection distribution into consideration, have not improve or even decrease the term’s discriminating power for categorization. • Binary and tf.chi significantly underperform the other schemes.