Classification with Multiple Decision Trees

Classification with Multiple Decision Trees CV-2003 Eran Shimonovitz Ayelet Akselrod-Ballin

Plan • Basic framework Query selection, impurity, stopping … • Intermediate summary • Combining multiple trees • Y.Amit & D.Geman’s approach • Randomization, Bagging, Boosting • Applications

Introduction A general classifier: Uses measurements made on the object to assign the object to a category

Some popular classification methods • Nearest Neighbor Rule • Bayesian Decision Theory • Fisher Linear Discriminant • SVM • Neural Network

Formulation • xmeasurement vector (x1,x2,…xd) Є X is pre-computed for each data point • C = {1,…,J} set of J classes, labeled Y(x) • L = {(x1, y1),…,(xN, yN)} learning sample • Data patterns can be • Ordered, numerical, real numbers • Categorical, nominal list of attributes A classification rule: a function defined on X so that for every x, Ŷ(x) is equal to one of (1,..J)

The goal is to construct a classifier such that is as small as possible

Root Node Sub-tree leaf Basic framework CART - classification and regression trees Trees are constructed by repeated splits of subsets of X into descendent Breiman and Colleagues 1984

Split Number: binary / multi value Every tree can be represented using just binary decision. Duda, Hart & Stork 2001

1/6, 1/6, 1/6, 1/6, 1/6, 1/6 1/3, 1/3, 1/3, 0, 0, 0 0, 0, 0, 1/3, 1/3, 1/3 Query selection & Impurity • P(ωj) more precisely Fraction of patterns at node T in category ωj • Impurity Φ is a nonnegative function defined on the set of all J–tuples satisfying

Impurity properties • When all categories are equally represented: • If all patterns that reach the nodes bear the same category: • Symmetric function of pω1, …,pωj Given Φ define the impurity measure i(T) at any node T:

10/16 6/16 7/10 , 3/10 1/6 , 5/6 X1<0.6 i(T)=0.88 i(T)=0.65 Entropy impurity 8/16 , 8/16

X1<0.6 X2<0.32 X2<0.61 X1<0.35 w1 w2 X1<0.69 w2 w1 w2 w1 Entropy impurity

Other Impurity Functions • Variance impurity: • Gini impurity: • Misclassification Impurity: The measure does not affect the overall performance

Conditional Entropy Goodness of split • Defined as the decrease in impurity where • Select the splits that maximize Δi(s,T) • Greedy method: local optimization t PL PR tL tR

Entropy formulation Vector of predictors assumed binary For each f calculate the conditional entropy on class given Xf

Trade off grow the tree fully until min impurity Over fitting Stop splitting too early error is not sufficiently low Stopping Criteria • Best candidate split at a node reduces the impurity by less than threshold. • Lower bound on the number/ percentage of points at a node. • Validation & cross validation • Statistical significance of impurity reduction

Recognizing Overfitting .5 .6 .7 .8 .9 Accuracy On training data On test data 0 10 20 30 40 50 60 70 80 Size of tree (number of nodes)

Assignments of leaf labels • When leaf nodes have positive impurity each node will be labeled by the category that has most points.

Recursive partitioning scheme Select attribute A to max impurity reduction [by defining P(j|N), i(N) ] For each possible value of A add a new branch Below new branch add sub-tree If stopping criterion is met Y N Node label = most common

X2<0.83 X1<0.27 X1<0.89 X2<0.34 X2<0.56 w2 w1 X1<0.09 w2 w1 X1<0.56 w2 w1 w2 w1

Preprocessing - PCA -0.8X1+0.6X2<0.3 w1 w2

Popular tree algortihms • ID3 - 3rd “interactive dichotomizer” (Quinlan 1983) • C4.5 – descendent of ID3 (Quinlan 1993) • C5.0

Interpretability, good insight of data structure. Rapid classification Multi – class Space complexity Refinement without reconstructing. Can be further depend Natural to incorporate prior experts knowledge … Instability - sensitivity to training points, a result of greedy process. Training time Over training sensitivity Difficult to understand if large … Pros&Cons

Combining multiple classification trees Main problem – Stability Small changes in training set cause large changes in classifier. Solution Grow multiple trees instead of just one and then combine the information. The aggregation produces significant improvement in accuracy.

Protocols for generating multiple classifiers • Randomization: of queries at each node. • Boosting: sequential reweighing, AdaBoost • Bagging: bootstrap aggregation Multiple trees

Y. Amit & D. Geman’s Approach Shape quantization and recognition with randomized trees, Neural computation, 1997. Shape recognition based on shape features & tree classifiers. The Goal: to select the informative shape features and build tree classifiers.

Randomization At each node: • Choose a random sample of predictors from the whole candidate collection. • Estimate the optimal predictor using a random sample of data points. • The size of these 2 random samples are parameters. Multiple trees

Multiple classification trees • Different trees correspond to different aspects of shapes, characterize from "different point of view". • Statistically weakly dependent due to randomization.

Aggregation After producing N trees T1, ..TN Maximize average terminal distribution P at terminal node, Lt(c) : set of training points of class c at node t Multiple trees

test point ω T1 T2 Tn Multiple trees

Data Classification examples: • Handwritten digits • LATEX symbols. • Binary images of 2D shapes. • All images are registered to a fixed grid of 32X32. • Considerable within class variation. Y. Amit & D. Geman

223,000 binary images of isolated digits written by more than 2000 writers. 100,000 for training and 50,000 for testing. Handwritten digits – NIST (National institute of standards and technology) Y. Amit & D. Geman

32 samples per class for all 293 classes. LATEX Symbols Synthetic deformations Y. Amit & D. Geman

Shape features • Each query corresponds to a spatialarrangement of local codes "tags". • Tags: coarse description (5 bit codes) of the local topography of intensity surface in the neighborhood of a pixel. • Discriminating power comes from their relative angles and distances of tags. Y. Amit & D. Geman

Tags • 4X4 sub-images are randomly extracted & recursively partitioned based on individual pixel values. • A tag type for each node of the resulting tree. • If 5 question are asked 2+4+8+16+32 = 62 tags Y. Amit & D. Geman

Tags (cont.) • Tag 16 is a depth 4 tag. The corresponding 4 questions in the following sub-image are indicated by the following mask. Where • 0 - background • 1 - object • n – “not asked” • These neighborhoods are loosely described by background to lower left, object to upper right. Y. Amit & D. Geman

Spatial arrangement of local features • The arrangement A is a labeled hyper-graph. Vertex labels correspond to the tag types and edge labels to relations. • Directional and distance constraints. • Query: whether such an arrangement exists anywhere in the image. Y. Amit & D. Geman

Example of node splitting The minimal extension of an arrangement A means the addition of one relation between existing tags, or the addition of exactly one tag and one relation binding the new tag to the existing one. Y. Amit & D. Geman

The trees are grown by the scheme described … Y. Amit & D. Geman

Importance of Multiple randomization trees Graphs found in the terminal node of five different trees. Y. Amit & D. Geman

Experiment - NIST • Stopping: Nodes are split while at least m points in the second largest class. • Q : # of random queries per node. • Random sample of 200 training points per node. • 25-100 trees are produces. • Depth 10 on average Y. Amit & D.Geman

Results • Best error rate with a single tree is 5% • The average rate per tree is about 91% • By aggregating trees the classification climbs • State-of-the-art error rates Rejection rate→ above 99% #T↓ Y. Amit & D.Geman

Conclusions • Stability & Accuracy - combining multiple trees leads to drastic decrease in error rates, relative to the best individual tree. • Efficiency – fast training & testing • Ability for visual interpretation of trees output. • Few parameters & Insensitive to parameter setting. Y. Amit & D.Geman

Conclusions – (cont.) • The approach is not model based, does not involve advanced geometry or extracting boundary information. • Missing aspect: features from more than one resolution. • Most successful handwritten CR reported by Lecun et al. 1998 (99.3%). • Used multi-layer feed forward based on raw pixel intensity. Y. Amit & D.Geman

Voting Tree Learning Algorithms A family of protocols for producing and aggregating multiple classifiers. • Improve predictive accuracy. • For unstable procedures. • Manipulate the training data in order to generate different classifiers. • Methods: Bagging, Boosting

Bagging A name derived from “bootstrap aggregation”. A “bootstrap” data set:Created by randomly selecting points from the training set, with replacement. Bootstrap estimation:Selection process is independently repeated.Data sets are treated as independent. Bagging Predictors, Leo Breiman, 1996

Bagging • Select a bootstrap sample, LB from L. • Grow a decision tree from LB. • Estimate the class of xn by plurality vote • # (estimated class ≠ true class)  Bagging - Algorithm

UCI Machine Learning Repository Bagging – DataBase

Results Error rates are the averages over 100 iterations. Bagging - Algorithm

C4.5 vs. bagging C4.5 UCI repository of machine learning database Boosting the Margin, Robert E. Schapire, Yoav Freund, Peter Bartlett & Wee Sun Lee, 1998 Bagging

Classification with Multiple Decision Trees