1 / 28

The joy of Entropy

The joy of Entropy. Administrivia. Reminder: HW 1 due next week No other news. No noose is good noose. Time wings on. Last time: Hypothesis spaces Intro to decision trees This time: Learning bias The getBestSplitFeature function Entropy. Back to decision trees. Reminders:

peta
Download Presentation

The joy of Entropy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The joy of Entropy

  2. Administrivia • Reminder: HW 1 due next week • No other news. No noose is good noose...

  3. Time wings on... • Last time: • Hypothesis spaces • Intro to decision trees • This time: • Learning bias • The getBestSplitFeature function • Entropy

  4. Back to decision trees... • Reminders: • Hypothesis space for DT: • Data struct view: All trees with single test per internal node and constant leaf value • Geometric view: Sets of axis-orthagonal hyper-rectangles; piecewise constant approximation • Open question: getBestSplitFeature function

  5. Splitting criteria • What properties do we want our getBestSplitFeature() function to have? • Increase the purity of the data • After split, new sets should be closer to uniform labeling than before the split • Want the subsets to have roughly the same purity • Want the subsets to be as balanced as possible

  6. Bias • These choices are designed to produce small trees • May miss some other, better trees that are: • Larger • Require a non-greedy split at the root • Definition: Learning bias == tendency of an algorithm to find one class of solution out of H in preference to another

  7. Bias: the pretty picture Space of all functions on

  8. Bias: the algebra • Bias also seen as expected difference between true concept and induced concept: • Note: expectation taken over all possible data sets • Don’t actually know that distribution either :-P • Can (sometimes) make a prior assumption

  9. More on Bias • Bias can be a property of: • Risk/loss function • How you measure “distance” to best solution • Search strategy • How you move through H to find

  10. Back to splitting... • Consider a set of true/false labels • Want our measure to be small when the set is pure (all true or all false), and large when set is almost evenly divided between the classes • In general: we call such a function impurity, i(y) • We’ll use entropy • Expresses the amount of information in the set • (Later we’ll use the negative of this function, so it’ll be better if the set is almost pure)

  11. Entropy, cont’d • Define: class fractions (a.k.a., class prior probabilities)

  12. Entropy, cont’d • Define: class fractions (a.k.a., class prior probabilities) • Define: entropy of a set

  13. Entropy, cont’d • Define: class fractions (a.k.a., class prior probabilities) • Define: entropy of a set • In general, for classes :

  14. The entropy curve

  15. Properties of entropy • Maximum when class fractions equal • Minimum when data is pure • Smooth • Differentiable; continuous • Convex • Intuitively: entropy of a dist tells you how “predictable” that dist is.

  16. Entropy in a nutshell From: Andrew Moore’s tutorial on information gain: http://www.cs.cmu.edu/~awm/tutorials

  17. Entropy in a nutshell Low entropy distribution data values (location of soup) sampled from tight distribution (bowl) -- highly predictable

  18. Entropy in a nutshell High entropy distribution data values (location of soup) sampled from loose distribution (uniformly around dining room) -- highly unpredictable

  19. Entropy of a split • A split produces a number of sets (one for each branch) • Need a corresponding entropy of a split (i.e., entropy of a collection of sets) • Definition: entropy of a B-way split where:

  20. Information gain • The last, easy step: • Want to pick the attribute that decreases the information content of the data as much as possible • Q: Why decrease? • Define: gain of splitting data set [X,y] on attribute a:

  21. The splitting method • Feature getBestSplitFeature(X,Y) { • // Input: instance set X, label set Y • double baseInfo=entropy(Y); • double[] gain=new double[]; • for (a : X.getFeatureSet()) { • [X0,...,Xk,Y0,...,Yk]=a.splitData(X,Y); • gain[a]=baseInfo-splitEntropy(Y0,...,Yk); • } • return argmax(gain); • }

  22. DTs in practice... • Growing to purity is bad (overfitting)

  23. DTs in practice... • Growing to purity is bad (overfitting) x2: sepal width x1: petal length

  24. DTs in practice... • Growing to purity is bad (overfitting) x2: sepal width x1: petal length

  25. DTs in practice... • Growing to purity is bad (overfitting) • Terminate growth early • Grow to purity, then prune back

  26. Not statistically supportable leaf Remove split & merge leaves DTs in practice... • Growing to purity is bad (overfitting) x2: sepal width x1: petal length

  27. DTs in practice... • Multiway splits are a pain • Entropy is biased in favor of more splits • Correct w/ gain ratio (DH&S Ch. 8.3.2, Eqn 7)

  28. DTs in practice... • Real-valued attributes • rules of form if (x1<3.4) { ... } • How to pick the “3.4”?

More Related