270 likes | 487 Views
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Rajeev Rastogi Kyuseok Shim. Presented by: Alon Keinan. Presentation layout. Introduction: Classification and Decision Trees Decision Tree Building Algorithms SPRINT & MDL PUBLIC Performance Comparison
E N D
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning Rajeev Rastogi Kyuseok Shim Presented by: Alon Keinan
Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions
Introduction: Classification • Classification in data mining: • Training sample set • Classifying future records • Techniques: Bayesian, NN, Genetic, decision trees …
Introduction: Decision Trees training
Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions
Decision Tree Building Algorithms • 2 phases: • The building phase • The pruning phase • The building constructs a “perfect” tree • The pruning prevents “overfitting”
Building Phase Algorithms • Differ in the selection of the test criterion for partitioning • CLS • ID3 & C4.5 • CART, SLIQ & SPRINT • Differ in their ability to handle large training sets • All consider “guillotine-cut” only
Pruning Phase Algorithms • MDL – Minimum Description Length • Cost-Complexity Pruning
Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions
SPRINT • Initialize root node • Initialize queue Q to contain root node • While Q is not empty do • dequeue the first node N in Q • if N is not pure • for each attribute evaluate splits • use least entropy splitto split node N into N1 and N2 • append N1 and N2 to Q
MDL • The best tree is the one that can be encoded using the fewest number of bits • Cost of encoding data records: • Cost of encoding tree: • The structure of the tree • The splits • The classes in the leaves
Pruning algorithm • computeCost&Prune(Node N) • If N is a leaf return (C(S)+1) • minCostLeft:=computeCost&Prune(Nleft) • minCostRight:=computeCost&Prune(Nright) • minCost:=min{C(S)+1,Csplit(N)+1+minCostLeft+minCostRight} • If minCost=C(S)+1 • Prune child nodes Nleft and Nright • return minCost
Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions
PUBLIC • PUBLIC = PrUning and BuiLding Integrated in Classification • Uses SPRINT for building • Prune periodically !!! • Basically uses MDL for pruning • Distinguished three types of leaves: • “not expandable” • “pruned” • “yet to be expanded” • Exact same tree
Lower Bound Computation • PUBLIC(1) • Bound=1 • PUBLIC(S) • Incorporating split costs • PUBLIC(V) • Incorporating split values
PUBLIC(S) • Calculates a lower bound for s=0,..,k-1 • For s=0: C(S)+1 • For s>0: • Takes the minimum of the bounds • Computes by iterative addition • O(klogk)
PUBLIC(V) • PUBLIC(S) estimates each split as log(a) • PUBLIC(V) estimates each split as log(a), plus the encoding of the splitting value\s • Complexity: O(k*(logk+a))
Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions
Performance Comparisons • Algorithms: • SPRINT • PUBLIC(1) • PUBLIC(S) • PUBLIC(V) • Data sets: • Real-life • Synthetic
Other Parameters • No. of Attributes • No. of Classes • Size of training set
Presentation layout • Introduction: Classification and Decision Trees • Decision Tree Building Algorithms • SPRINT &MDL • PUBLIC • Performance Comparison • Conclusions
Conclusion • The pruning is integrated into the building phase • Computing lower bounds of the cost of “yet to be expanded” leaves • Improved performance • Open: • How often to invoke the pruning procedure? • Expanding other algorithms … • Developing a tighter lower bound…