1 / 16

Evaluation of Decision Forests on Text Categorization

This study evaluates the performance of Decision Forests on text categorization using two text collections, Reuters and OHSUMED. The text collections are preprocessed with feature extraction techniques such as stop word removal, stemming, and term selection. Three classifiers, kNN, C4.5, and Decision Forest, are compared based on precision, recall, and F1 value. The results show that Decision Forest outperforms C4.5 and kNN in both category dependent and independent cases. The performance degrades when transitioning from Reuters to OHSUMED, with kNN being the most affected classifier.

mdaughtry
Download Presentation

Evaluation of Decision Forests on Text Categorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of Decision Forestson Text Categorization

  2. Text Categorization • Text Collection • Feature Extraction • Classification • Evaluation

  3. Text Collection • Reuters • Newswires from Reuters in 1987 • Training set: 9603 • Test set: 3299 • Categories: 95 • OHSUMED • Abstracts from medical journals • Training set: 12327 • Test set: 3616 • Categories: 75 (within Heart Disease subtree)

  4. Feature Extraction • Stop Word Removal • 430 stop words • Stemming • Porter’s stemmer • Term Selection • by Document Frequency • Category independent selection • Category dependent selection • Feature Extraction • TF  IDF

  5. Classification • Method • Each document may belong to multiple categories • Treating each category as a separate classification problem • Binary classification • Classifiers • kNN (k Nearest Neighbor) • C4.5 (Quinlan) • Decision Forest

  6. C4.5 • A method to build decision trees • Training • Grow the tree by splitting the data set • Prune the tree back to prevent over-fitting • Testing • Test vector goes down the tree and arrives at a leaf. • Probability that the vector belongs to each category is estimated.

  7. Decision Forest • Consisting of many decision trees combined by averaging the class probability estimates at the leaves. • Each tree is constructed in a randomly chosen (coordinate) subspace of the feature space. • An oblique hyperplane is used as a discriminator at each internalnode of the trees.

  8. Why choose these 3 classifiers? • We do not have a parametric model for the problem (we cannot assume Gaussian distributions etc.) • kNN and decision tree (c4.5) are the most popular nonparametric classifiers. We use them as the baselines for comparison • We expect decision forest to do well since we have a high dimensional problem for which it is known to do well from previous studies

  9. Evaluation • Measurements • Precision p = a / (a+b) • Recall r = a / (a+c) • F1 value F1 = 2rp / (r+p) • Tradeoff between Precision and Recall • kNN tends to have higher precision than recall, especially when k becomes larger.

  10. Averaging scores • Macro-averaging • Calculate precision/recall for each category • Average all the precision/recall values • Assign equal weight to each category • Micro-averaging • Sum up classification decision of each document • Calculate precision/recall from the summations • Assign equal weight to each document • This was used in experiment because the number of documents in each category varies considerably.

  11. Performance in F1 Value

  12. Comparison between Classifiers • Decision Forest better than C4.5 and kNN • In category dependent case, C4.5 better than kNN • In category independent case, kNN better than C4.5

  13. Category Dependent vs. Independent method • For Decision Forest and C4.5, category dependent better than independent. • But for kNN, category independent better than dependent. • No obvious explanation found.

  14. Reuters vs. OHSUMED • All classifiers degrades from Reuters to OHSUMED • kNN degrades faster(26%) than C4.5(12%) and DF(12%)

  15. Reuters vs. OHSUMED • OHSUMED is a harder problem because: • Documents are more evenly distributed • This even distribution confuses kNN recall rate more than others, because there are more confusion classes in the fixed size neighborhood.

  16. Conclusion • Decision Forest is substantially better than C4.5 and kNN in text categorization • Difficult to make comparison with results of other classifiers outside this experiment, because • Different ways of spliting training/test set • Different term selection methods

More Related