440 likes | 449 Views
Learn how to build decision trees for classifying patterns using a supervised learning algorithm. Understand the importance of attribute selection and the power of decision tree expressive language.
E N D
Inductionof Decision TreesLaurent Orseau(laurent.orseau@agroparistech.fr)AgroParisTechbased on slides by Antoine Cornuéjols
Induction of decision trees • Task • Learning a discrimination function for patterns of several classes • Protocol • Supervised learning by greedy iterative approximation • Criterion of success • Classification error rate • Inputs • Attribute-value data (space with N dimensions) • Target functions • Decision trees
1- Decision trees: example • Decision trees are classifiers for attribute/value instances • A node of the tree test for an attribute • There is a branch for each value of the tested attribute • The leaves specify the categories (two or more) pain? throat chest abdomen aucune infarctus fever? cough? appendicitis no yes yes no nothing fever? a cold throat aches yes no a cold cooling
1- Decision trees: the problem • Each instance is described by an attribute/value vector • Input: an set of instances with their class (given by an expert) • Learning algorithm must build a decision tree E.g. a decision tree for diagnostic (common application in Machine Learning) CoughFeverWeight Pain Marie no yes normal throat Fred no yes normal abdomen Julie yes yes thin none Elvis yes no obese chest Cough Fever Weight Pain Diagnostic Marie no yes normal throat a cold Fred no yes normal abdomen appendicitis .....
1- Decision trees: expressive power • The choice of the attributes is very important! • If a crucial attribute is not represented • Not possible to induce a good decision tree • If two instances have the same representation but belong to two different classes, the language of the instances (attributes) is said to be inadequate. Cough Fever Weight Pain Diagnostic Marie no yes normal abdomena cold Polo no yes normal abdomenappendicitis ..... inadequate language
1- Decision trees: expressive power • Any boolean function can be represented with a decision tree • Note: with 6 boolean attributes, there are about 1.8*10^19 booleanfunctions… • Depending on the functions to represent, the trees are more or less large • E.g. “parity” and “majority” function: exponential growth • Sometimes a single node is enough • Limited to propositionallogic (only attribute-value, no relation) • A tree can be represented by a disjunction of rules: (Si Feathers = no Alors Classe= not-bird) OR (Si Feathers = yes AND Color= brown Alors Classe= not-bird) OR (Si Feathers = yes AND Color= B&W Alors Classe= bird) OR (Si Feathers = yes AND Color= yellow Alors Classe= bird) DT4
2- Decision trees: choice of a tree Color Wings Feathers SonarConcept Falcon yellow yes yes no bird Pigeon B&W yes yes no bird Bat brown yes no yes not bird Quatre decision trees coherents with the data: Feathers? DT3 DT4 DT1 no yes Feathers? Color? bird not bird no yellow yes brown B&W bird Color? not bird not bird Sonar? DT2 bird yellow brown no yes B&W bird not bird bird not bird bird
How to give a value to a tree? 2- Decision trees: the choice of a tree • When the langage is adequate,it is always possible to build a decision trees that correctly classifies all the training examples. • There are often many correct decision trees. • Enumeration of all trees is not possible (NP-completeness) for binary trees • Requires a constructive iterative method
2- What model for generalization? • Among all possible coherent hypotheses, which one to choose for a good generalization? • Is the intuitive answer… • ... confirmed by theory? • Some learnability theory [Vapnik,82,89,95] • Consistence of the empirical risk minimization(ERM) • Principle of structural risk minimization (SRM) • In short, trees must be short • How? • Methods of induction of decision trees
Attributes Pif Temp Humid Wind Possible Values sunny,cloudy,rain hot,warm,cool normal,high true,false N° Pif Temp Humid Wind Golf 1 sunny hot high false DontPlay 2 sunny hot high true DontPlay 3 cloudy hot high false Play 4 rain warm high false Play 5 rain cool normal false Play 6 rain cool normal true DontPlay 7 cloudy cool normal true Play 8 sunny warm high false DontPlay 9 sunny cool normal false Play 10 rain warm normal false Play 11 sunny warm normal true Play 12 cloudy warm high true Play 13 cloudy hot normal false Play 14 rain warm high true DontPlay 3- Induction of decision trees: Example [Quinlan,86] class
3- Induction of decision trees • Strategy: Top-down induction: TDIDT • Best first search, no backtracking, with a evaluation function • Recursive choice of an attribute to test until stopping criterion • Operation: • Choose the first attribute as the root of the tree: the most informative one • Then, iterate with same operation on all sub-nodes • recursivealgorithm
N° Pif Temp Humid Wind Golf 1 sunny hot high false DontPlay 2 sunny hot high true DontPlay 3 cloudy hot high false Play 4 rain warm high false Play 5 rain cool normal false Play 6 rain cool normal true DontPlay 7 cloudy cool normal true Play 8 sunny warm high false DontPlay 9 sunny cool normal false Play 10 rain warm normal false Play 11 sunny warm normal true Play 12 cloudy warm high true Play 13 cloudy hot normal false Play 14 rain warm high true DontPlay + - J3,J4,J5,J7,J9,J10,J11,J12,J13 J1,J2, J6,J8,J14 Temp? hot warm cool + - + - J5,J7,J9 J6 J3,J13 J1,J2 + - J4,J10,J11,J13 J8,J14 3- Induction of decision trees: example • If we choose attribute Temp? ...
3- Induction of decision trees: TDIDTalgorithm PROCEDURE AAD(T,E) IF all examples of E are in the same class Ci THEN label the current node with Ci. END ELSE selectan attribute A with values v1...vn Partition E with v1...vn into E1, ...,En For j=1 to nAAD(Tj, Ej). E A={v1...vn} E=E1.. En T vn v1 v2 En E2 E1 Tn T2 T1
3- Induction of decision trees: selection of attribute + - J3,J4,J5,J7,J9,J10,J11,J12,J13 J1,J2, J6,J8,J14 Wind? false true + - + - J7,J11,J12 J2,J6,J14 J3,J4,J5,J9,10,J13 J1,J8 + - J3,J4,J5,J7,J9,J10,J11,J12,J13 J1,J2, J6,J8,J14 Pif? sunny cloudy rain + - + - J9,J11 J1,J8,J2 J3,J13,J7,J12 + - J4,J5,J10 J6,J14
3- La selection of a warm attribute of test • How to build a “simple” tree? • Simple tree: Minimize expected number of tests to class a new object • How to translate this global criterion into a local choice procedure? • Criterions to choose a node • We don't know how to associate a local criterion to the global objective criterion • Use of heuristics • Notion of measure of ”impurity” • Gini Index • Entropic criterion (ID3, C4.5, C5.0) • ...
3- Measure of impurity: the Gini index • Ideally: • Null measure if all populations are homogeneous • Maximal measure if the populations are maximally mixed • Gini Index [Breiman and al.,84]
3- The entropic criterion (1/3) • Boltzmann's entropy ... • ... and Shannon's entropy • Shannon, 1949, proposed a measure of entropy for discrete probability distributions. • Expresses the quantity of information, i.e. the number of bits need to specify the distribution • Information entropy: where pi is the probability of class Ci.
3- The entropic criterion (2/3) Information entropy of S (in C classes): p(ci): probability of the class ci • Null when only one class • The most equiprobable the classes are, the highest I(S) • = log2(k) when the k classes are equiprobable • Unit: the bit of information
3- The entropic criterion (3/3): case of two classes • For C=2: I(S) = -p+xlog2(p+)- p-xlog2(p-) From hypothesis p+ = p/ (p+n) and p- = n/ (p+n) Thus I(S) = - p log ( p )- n log( n ) (p+n) (p+n) (p+n) (p+n) et I(S) = - P log P - (1-P) log(1-P) I(S) P=p/(p+n)=n/(n+p)=0.5 equiprobability P
3- Entropic gain associated with an attribute |Sv|: size of the sub-population in the branch v of A How is the knowledge of the value of attribute A informative about the class of an example
3- Example (1/4) • Entropy of initial set of examples I(p,n) = - 9/14 log2(9/14) - 5/14 log2(5/14) • Entropy of subtrees associated with test on Pif? • p1 = 4 n1 = 0: I(p1,n1) = 0 • p2 = 2 n2 = 3: I(p2,n2) = 0.971 • p3 = 3 n3 = 2: I(p3,n3) = 0.971 • Entropy of subtrees associated with test on Temp? • p1 = 2 n1 = 2: I(p1,n1) = 1 • p2 = 4 n2 = 2: I(p2,n2) = 0.918 • p3 = 3 n3 = 1: I(p3,n3) = 0.811
3- Example (2/4) N objects n+p=N I(S) Attribute A val2 val3 val1 N1 objects n1+p1=N1 N2 objects n2+p2=N2 N3 objects n3+p3=N3 N1+N2+N3=N E(N,A)= N1/N x I(p1,n1) + N2/N xI(p2,n2) + N3/N x I(p3,n3) Information gain of A : GAIN(A)= I(S)-E(N,A)
3- Example (3/4) • For the initial examples I(S) = - 9/14 log2(9/14) - 5/14 log2(5/14) • Entropy of the tree associated with test on Pif? • E(Pif) = 4/14 I(p1,n1) + 5/14 I(p2,n2) + 5/14 I(p3,n3) • Gain(Pif) = 0.940 - 0.694 = 0.246 bits • Gain(Temp) = 0.029 bits • Gain(Humid) = 0.151 bits • Gain(Wind) = 0.048 bits • Choice of attribute Pif for the first test
3- Example (4/4) • Finale built tree: Pif sunny cloudy rain Humid Wind play normal high no yes play play don't play don't play
3- Some TDIDT systems • CLS (Hunt, 1966) [analyse of data] • ID3 (Quinlan 1979) • ACLS (Paterson & Niblett 1983) • ASSISTANT (Bratko 1984) • C4.5 (Quinlan 1986) • CART (Breiman, Friedman, Ohlson, Stone, 1984) Input: vector of attributes-values associated with each example Output: decision tree
4- Potential problems • Continuous value attributes • Attributes with different branching factors • Missing values • Overfitting • Greedy search • Choice of attributes • Variance of results: • Different trees from similar data
4.1. Discretization of continuous attribute values Here, two possible thresholds: 16°C and 30°C attributeTemp>16°Cis the most informative,and is kept Temp. 14°C 18°C 20°C 28°C 32°C 6°C 8°C Play au golf Non Non Non Oui Oui Oui Non
Gain ( S , A ) Gain _ norm ( S , A ) = S S nb values of A å × log i i S S = i 1 4.2. Different branching factors • Problem: The entropic gain criterion favors attributes with higher branching factor • Two solutions: • Make all attributes binary • But loss of legibility of trees • Introduce a normalization factor
4.3. Processing missing values • Let an example x , c(x) for which we don't know the value for attribute A • How to compute gain(S,A)? • Take the most frequent value in entire S • Take the most frequent value at this node • Split example in fictitious examples with the different possible values of A weighted by their respective frequency • E.g. if 6 examples at this node take the value A=a1 and 4 the value A=a2A(x) = a1 with prob=0.6 and A(x) = a2 with prob=0.4 • For prediction, class the example with the label of the most probable leaf.
5- The generalization problem • Training set. Ensemble test. • Learning curve • Methods to evaluate generalization • On a test set • Cross validation • “Leave-one-out” Did we learn a good decision tree?
5.1. Overfitting: Effect of noise on induction • Types of noise • Description errors • Classification errors • “clashes” • Missing values • Effects • Over-developed tree: too deep, too many leaves
5.1. Overfitting: The generalization problem • Low empirical risk. High real risk. • SRM (Structural Risk Minimization) • Justification [Vapnik,71,79,82,95] • Notion of “capacity” of the hypothesis space • Vapnik-Chervonenkis dimension We must control the hypothesisspace
5.1. Control of space H: motivations & strategies • Motivations: • Improve generalization performance (SRM) • Build a legible model of the data (for experts) • Strategies: 1. Direct control of the size of the induced tree: pruning 2. Modify the state space (trees) in which to search 3. Modify the search algorithm 4. Restrain the data base 5. Translate built trees into another representation
5.2. Overfitting:Controlling the size with pre-pruning • Idea: modify the termination criterion • Depth threshold (e.g. [Holte,93]: threshold =1 or 2) • Chi2 test • Laplacian error • Low information gain • Low number of examples • Population of examples not statistically significant • Comparison between ”static error” and ” dynamic error” • Problem: often too short-sighted
5.2. Example: Chi2 test Let a binary attribute A Null hypothesis (n) = (n1,n2) (n) = (n1,n2) A A g d g d P (1-P) P (1-P) (ned1,ned2) (neg1,neg2) (nd1,nd2) (ng1,ng2) n1 =ng1 + nd1 n2 = ng2 + nd2 neg1 = Pn1 ; ned1 = (1-P)n1 neg2 = Pn2 ; ned2 = (1-P)n2
5.3. Overfitting: Controlling the size with post-pruning • Idea: Prune after the construction of whole tree, by replacing subtrees that optimize a pruning criterion on a node. • Many methods. Still lots of research. • Minimal Cost-Complexity Pruning (MCCP) (Breiman and al.,84) • Reduced Error Pruning (REP) (Quinlan,87,93) • Minimum Error Pruning (MEP) (Niblett & Bratko,86) • Critical Value Pruning (CVP) (Mingers,87) • Pessimistic Error Pruning (PEP) (Quinlan,87) • Error-Based Pruning (EBP) (Quinlan,93) (used in C4.5) • ...
5.3- Cost-Complexity pruning • [Breiman and al.,84] • Cost-complexity for a tree:
6. Forward search • Instead of a greedy search, search n nodes ahead • If I choose this node and then this node and then … • But exponential growth of the number of computations
6. Modification of the search strategy • Idea: no more depth first search • Methods that use a different measure: • Minimum Description Length principle • Measure of the complexity of the tree • Measure of the complexity of the examples not coded by the tree • Keep tree that minimizes the sum of these measures • Measure of low learnability theory • Kolmogorov-Smirnoff measure • Class separation measure • Mix of selection tests
7. Modification of the search space • Modification of the node tests • To solve the problems of an inadequate representations • Methods of constructive induction (e.g. multivariate tests) E.g. Oblique decision trees • Methods: • Numerical Operators • Perceptron trees • Trees and Genetic Programming • Logical operators
7. Oblique trees x1 < 0.70 x2 x2 < 0.88 x2 < 0.30 c2 c2 c1 x1 < 0.17 x2 < 0.62 x1 c2 c2 c1 c2 c1 1.1x1 + x2 < 0.2 c2 c1
7. Induction of oblique trees • Other cause of leafy trees: an inadequate representation • Solutions: • Ask an expert (e.g. chess endgame [Quinlan,83]) • Do an PCA beforehand • Other attributeselection method • Apply a constructiveinduction • Induction of oblique trees
8. Translation into other representations • Idea: Translate a complex tree into a representation where the result is simpler • Translation into decision graphs • Translation rule sets
9. Conclusions • Appropriate for: • Classification of attribute-value examples • Attributes with discrete values • Resistance to noise • Strategy: • Search by incremental construction of hypothesis • Local criterion (gradient) based on statistical criterion • Generates • Interpretable decision trees (e.g. production rules) • Requires a control of the size of the tree