1 / 27

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Rajeev Rastogi Kyuseok Shim. Presented by: Alon Keinan. Presentation layout. Introduction: Classification and Decision Trees Decision Tree Building Algorithms SPRINT & MDL PUBLIC Performance Comparison

alvarezc
Download Presentation

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning Rajeev Rastogi Kyuseok Shim Presented by: Alon Keinan

  2. Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions

  3. Introduction: Classification • Classification in data mining: • Training sample set • Classifying future records • Techniques: Bayesian, NN, Genetic, decision trees …

  4. Introduction: Decision Trees training

  5. Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions

  6. Decision Tree Building Algorithms • 2 phases: • The building phase • The pruning phase • The building constructs a “perfect” tree • The pruning prevents “overfitting”

  7. Building Phase Algorithms • Differ in the selection of the test criterion for partitioning • CLS • ID3 & C4.5 • CART, SLIQ & SPRINT • Differ in their ability to handle large training sets • All consider “guillotine-cut” only

  8. Pruning Phase Algorithms • MDL – Minimum Description Length • Cost-Complexity Pruning

  9. Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions

  10. SPRINT • Initialize root node • Initialize queue Q to contain root node • While Q is not empty do • dequeue the first node N in Q • if N is not pure • for each attribute evaluate splits • use least entropy splitto split node N into N1 and N2 • append N1 and N2 to Q

  11. Entropy

  12. MDL • The best tree is the one that can be encoded using the fewest number of bits • Cost of encoding data records: • Cost of encoding tree: • The structure of the tree • The splits • The classes in the leaves

  13. Pruning algorithm • computeCost&Prune(Node N) • If N is a leaf return (C(S)+1) • minCostLeft:=computeCost&Prune(Nleft) • minCostRight:=computeCost&Prune(Nright) • minCost:=min{C(S)+1,Csplit(N)+1+minCostLeft+minCostRight} • If minCost=C(S)+1 • Prune child nodes Nleft and Nright • return minCost

  14. Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions

  15. PUBLIC • PUBLIC = PrUning and BuiLding Integrated in Classification • Uses SPRINT for building • Prune periodically !!! • Basically uses MDL for pruning • Distinguished three types of leaves: • “not expandable” • “pruned” • “yet to be expanded” • Exact same tree

  16. Lower Bound Computation • PUBLIC(1) • Bound=1 • PUBLIC(S) • Incorporating split costs • PUBLIC(V) • Incorporating split values

  17. PUBLIC(S) • Calculates a lower bound for s=0,..,k-1 • For s=0: C(S)+1 • For s>0: • Takes the minimum of the bounds • Computes by iterative addition • O(klogk)

  18. PUBLIC(V) • PUBLIC(S) estimates each split as log(a) • PUBLIC(V) estimates each split as log(a), plus the encoding of the splitting value\s • Complexity: O(k*(logk+a))

  19. Lower Bound ComputationSummary

  20. Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions

  21. Performance Comparisons • Algorithms: • SPRINT • PUBLIC(1) • PUBLIC(S) • PUBLIC(V) • Data sets: • Real-life • Synthetic

  22. Real-life Data Sets

  23. Synthetic Data Sets

  24. Noise

  25. Other Parameters • No. of Attributes • No. of Classes • Size of training set

  26. Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions

  27. Conclusion • The pruning is integrated into the building phase • Computing lower bounds of the cost of “yet to be expanded” leaves • Improved performance • Open: • How often to invoke the pruning procedure? • Expanding other algorithms … • Developing a tighter lower bound…

More Related