160 likes | 329 Views
Decision Trees. Radosław Wesołowski Tomasz Pękalski , Michal Borkowicz , Maciej Kopaczyński 12-03-2008. What is it anyway? Decision tree T – a tree with a root (in graph theory sense), in which we assign the following meanings to its elements:. inner nodes represent attributes ,
E N D
Decision Trees Radosław Wesołowski Tomasz Pękalski, Michal Borkowicz , Maciej Kopaczyński12-03-2008
What is it anyway? Decision tree T – a tree with a root (in graph theory sense), in which we assignthe following meanings to its elements: • inner nodes represent attributes, • edges represent values of the attribute, • leafs represent classification decisions. • Using decision tree we can visualize a program with only ‘if-then’ instructions.
Testing functions Let us consider an attribute A (e.g. temperature). Let VA mean the set of all possible values of A (0K up to infinity). Let Rt mean the set of all possible test results (hot, mild, cold). As a testing function we mean a map t: VARt We distinguish two main types of testing functions, depending on the set VA - discrete and continuous.
Quality of a decision tree (Occam's razor): • we prefer small, simple trees, • we want to gain maximum accuracy of classification (training set, test set) • For example: • Q(T) = *size(T) + *accuracy(T)
Optimal tree – we are given: • a training set S, • a testing functions set TEST, • quality criterion Q. • Target: T optimising Q(T). • Fact: usually this is NP-hard problem. • Conclusion: we have to use heuristics.
Building a decision tree: • top_down method: • a. In the beginning the root includes all training examples • b. We divide them recursively, choosing one attribute at a time • - bottom_up: we remove subtrees or edges to gain precision for judging new cases.
Entropy – average bits amount to represent a decision d for a randomly chosen object from a given set S. Why? Because optimal binary representation assigns –log2(p) bits to a decision which probability is p. We have formula: entropy(p1,...pn)= - p1*log2(p1) - ... - pn*log2(pn)
Information gain: gain(.) = info before dividing – info after dividing
Overtraining: We say that a model H overfits if there is a model H’ such that : • training_error(H) < training_error(H’), • testing_error(H) > testing_error(H’). • Avoiding overtraining: • adequate stop criterions, • posprunning, • preprunning.
Some decision trees algorithms: • R1, • ID3 (Interactive dichotomizer version 3), • C4.5 (ID3 + discretization + prunning), • CART (Classification and Regression Trees), • CHAID (CHi-squared Automatic Interaction • Detection).