1 / 32

Hierarchical Classification of Documents with Error Control

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab

Download Presentation

Hierarchical Classification of Documents with Error Control

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation • In Slide Show, click on the right mouse button • Select “Meeting Minder” • Select the “Action Items” tab • Type in action items as they come up • Click OK to dismiss this box • This will automatically create an Action Item slide at the end of your presentation with your points entered. Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King

  2. Overview • Abstract • Problem Description • Document Classification Model • Error Control Schemes • Recovery oriented scheme • Error masking scheme • Experiments • Conclusion

  3. Abstract • Traditional document classification (flat classification) involves only a single classifier • Single classifier takes care of everything • Slow and high overhead

  4. Abstract • Hierarchical document classification • Class hierarchy • Use one classifier at each internal node

  5. Abstract • Advantage • Better performance • Disadvantage • Wrong result if misclassified in any node

  6. Abstract • Introduce error control mechanism • Approach 1 (recovery oriented) • Detect and correct misclassification • Approach 2 (error masking) • Mask errors by using multiple versions of classifiers

  7. Problem Description Class Taxonomy Statistics Training System Training Documents class | doc_id … | … Feature Terms Class-doc Relation

  8. Problem Description Statistics Classification System Target Class Feature Terms Incoming Documents

  9. Problem Description • Objective: Achieve • Higher accuracy • Fast performance • Our proposed algorithms provide a good trade-off between accuracy and performance

  10. c c1c2 … cn Document Classification Model • Formally, we use a model from [Chakrabarti et al. 1997] • Based on naive Bayesian network • For simplicity, we study a single node classifier.

  11. Probability that an incoming document d belongs to c is zi,d—number of occurrence of term i in the incoming document d Pj, c— probability that a word in class c is j (estimated using the training data)

  12. Feature Selection • Previous formula involves all the terms • Feature selection reduces cost by using only the terms with good discriminating power • Use the training sets to identify the feature terms

  13. Fisher’s Index • Fisher’s Index indicates the discriminating power of a term • Good discriminating power: large interclass distance, small intraclass distance Interclass distance c1 c2 w(t) Intraclass distance

  14. c c1c2 … cn Document Classification Model • Consider only feature terms in the classification function p(ci|c,d) • Pick the largest probability among all ci • Use one classifier in each internal node

  15. Recovery Oriented Scheme • Database system • Failure in DBMS • Restart from a consistent state • Document classification • Error detected • Restart from a correct class (High Confidence Ancestor, or HCA)

  16. Recovery Oriented Scheme • In practice, • Rollback is slow • Identify wrong paths and avoid them • To identify wrong paths, • Define closeness indicator (CI) • On wrong path, when CI falls below a threshold

  17. HCA Recovery Oriented Scheme Define distance of HCA and current node = 2 Wrong path

  18. HCA HCA Recovery Oriented Scheme Define distance of HCA and current node = 2 Wrong path

  19. Error Masking Scheme • Software Fault Tolerance • Run multiple versions of software • Majority voting • Document Classification • Run classifiers of different designs • Majority voting

  20. O-Classifier • Traditional classifier

  21. N-classifier • Skip some intermediate levels

  22. Error Masking Scheme • Run three classifiers in parallel • O-classifier • N-classifier • O-classifier using new feature length • This selection minimizes the time wasted on waiting the slowest classifiers

  23. Experiments • Data Sets • US Patents • Preclassified • Rich text content • Highly hierarchical • 3 Sets Collected • 3 levels/large no of docs • 4 levels/large no of docs • 7 levels/small no of docs

  24. Experiments • Algorithm compared • Simple hierarchical • TAPER • Flat • Recovery oriented • Error masking • Generally, • flat is the slowest and the most accurate • simple hierarchical is the fastest and the least accurate

  25. Accuracy: 3 levels/large

  26. Accuracy: 4 levels/large

  27. Accuracy: 7 levels/small

  28. Performance: 3 levels/large

  29. Performance: 4 levels/large

  30. Performance: 7 levels/small

  31. Conclusion • Real-life application • Large taxonomy • Flat classification is too slow • Our algorithm is faster than flat classification at as low as 4 levels • Performance gain widens as the number of levels increases • A good trade-off between accuracy and performance for most applications

  32. Thank You The End

More Related