160 likes | 294 Views
Evaluation of Decision Forests on Text Categorization. Text Categorization. Text Collection Feature Extraction Classification Evaluation. Text Collection. Reuters Newswires from Reuters in 1987 Training set: 9603 Test set: 3299 Categories: 95 OHSUMED Abstracts from medical journals
E N D
Text Categorization • Text Collection • Feature Extraction • Classification • Evaluation
Text Collection • Reuters • Newswires from Reuters in 1987 • Training set: 9603 • Test set: 3299 • Categories: 95 • OHSUMED • Abstracts from medical journals • Training set: 12327 • Test set: 3616 • Categories: 75 (within Heart Disease subtree)
Feature Extraction • Stop Word Removal • 430 stop words • Stemming • Porter’s stemmer • Term Selection • by Document Frequency • Category independent selection • Category dependent selection • Feature Extraction • TF IDF
Classification • Method • Each document may belong to multiple categories • Treating each category as a separate classification problem • Binary classification • Classifiers • kNN (k Nearest Neighbor) • C4.5 (Quinlan) • Decision Forest
C4.5 • A method to build decision trees • Training • Grow the tree by splitting the data set • Prune the tree back to prevent over-fitting • Testing • Test vector goes down the tree and arrives at a leaf. • Probability that the vector belongs to each category is estimated.
Decision Forest • Consisting of many decision trees combined by averaging the class probability estimates at the leaves. • Each tree is constructed in a randomly chosen (coordinate) subspace of the feature space. • An oblique hyperplane is used as a discriminator at each internalnode of the trees.
Why choose these 3 classifiers? • We do not have a parametric model for the problem (we cannot assume Gaussian distributions etc.) • kNN and decision tree (c4.5) are the most popular nonparametric classifiers. We use them as the baselines for comparison • We expect decision forest to do well since we have a high dimensional problem for which it is known to do well from previous studies
Evaluation • Measurements • Precision p = a / (a+b) • Recall r = a / (a+c) • F1 value F1 = 2rp / (r+p) • Tradeoff between Precision and Recall • kNN tends to have higher precision than recall, especially when k becomes larger.
Averaging scores • Macro-averaging • Calculate precision/recall for each category • Average all the precision/recall values • Assign equal weight to each category • Micro-averaging • Sum up classification decision of each document • Calculate precision/recall from the summations • Assign equal weight to each document • This was used in experiment because the number of documents in each category varies considerably.
Comparison between Classifiers • Decision Forest better than C4.5 and kNN • In category dependent case, C4.5 better than kNN • In category independent case, kNN better than C4.5
Category Dependent vs. Independent method • For Decision Forest and C4.5, category dependent better than independent. • But for kNN, category independent better than dependent. • No obvious explanation found.
Reuters vs. OHSUMED • All classifiers degrades from Reuters to OHSUMED • kNN degrades faster(26%) than C4.5(12%) and DF(12%)
Reuters vs. OHSUMED • OHSUMED is a harder problem because: • Documents are more evenly distributed • This even distribution confuses kNN recall rate more than others, because there are more confusion classes in the fixed size neighborhood.
Conclusion • Decision Forest is substantially better than C4.5 and kNN in text categorization • Difficult to make comparison with results of other classifiers outside this experiment, because • Different ways of spliting training/test set • Different term selection methods