280 likes | 468 Views
The joy of Entropy. Administrivia. Reminder: HW 1 due next week No other news. No noose is good noose. Time wings on. Last time: Hypothesis spaces Intro to decision trees This time: Learning bias The getBestSplitFeature function Entropy. Back to decision trees. Reminders:
E N D
Administrivia • Reminder: HW 1 due next week • No other news. No noose is good noose...
Time wings on... • Last time: • Hypothesis spaces • Intro to decision trees • This time: • Learning bias • The getBestSplitFeature function • Entropy
Back to decision trees... • Reminders: • Hypothesis space for DT: • Data struct view: All trees with single test per internal node and constant leaf value • Geometric view: Sets of axis-orthagonal hyper-rectangles; piecewise constant approximation • Open question: getBestSplitFeature function
Splitting criteria • What properties do we want our getBestSplitFeature() function to have? • Increase the purity of the data • After split, new sets should be closer to uniform labeling than before the split • Want the subsets to have roughly the same purity • Want the subsets to be as balanced as possible
Bias • These choices are designed to produce small trees • May miss some other, better trees that are: • Larger • Require a non-greedy split at the root • Definition: Learning bias == tendency of an algorithm to find one class of solution out of H in preference to another
Bias: the pretty picture Space of all functions on
Bias: the algebra • Bias also seen as expected difference between true concept and induced concept: • Note: expectation taken over all possible data sets • Don’t actually know that distribution either :-P • Can (sometimes) make a prior assumption
More on Bias • Bias can be a property of: • Risk/loss function • How you measure “distance” to best solution • Search strategy • How you move through H to find
Back to splitting... • Consider a set of true/false labels • Want our measure to be small when the set is pure (all true or all false), and large when set is almost evenly divided between the classes • In general: we call such a function impurity, i(y) • We’ll use entropy • Expresses the amount of information in the set • (Later we’ll use the negative of this function, so it’ll be better if the set is almost pure)
Entropy, cont’d • Define: class fractions (a.k.a., class prior probabilities)
Entropy, cont’d • Define: class fractions (a.k.a., class prior probabilities) • Define: entropy of a set
Entropy, cont’d • Define: class fractions (a.k.a., class prior probabilities) • Define: entropy of a set • In general, for classes :
Properties of entropy • Maximum when class fractions equal • Minimum when data is pure • Smooth • Differentiable; continuous • Convex • Intuitively: entropy of a dist tells you how “predictable” that dist is.
Entropy in a nutshell From: Andrew Moore’s tutorial on information gain: http://www.cs.cmu.edu/~awm/tutorials
Entropy in a nutshell Low entropy distribution data values (location of soup) sampled from tight distribution (bowl) -- highly predictable
Entropy in a nutshell High entropy distribution data values (location of soup) sampled from loose distribution (uniformly around dining room) -- highly unpredictable
Entropy of a split • A split produces a number of sets (one for each branch) • Need a corresponding entropy of a split (i.e., entropy of a collection of sets) • Definition: entropy of a B-way split where:
Information gain • The last, easy step: • Want to pick the attribute that decreases the information content of the data as much as possible • Q: Why decrease? • Define: gain of splitting data set [X,y] on attribute a:
The splitting method • Feature getBestSplitFeature(X,Y) { • // Input: instance set X, label set Y • double baseInfo=entropy(Y); • double[] gain=new double[]; • for (a : X.getFeatureSet()) { • [X0,...,Xk,Y0,...,Yk]=a.splitData(X,Y); • gain[a]=baseInfo-splitEntropy(Y0,...,Yk); • } • return argmax(gain); • }
DTs in practice... • Growing to purity is bad (overfitting)
DTs in practice... • Growing to purity is bad (overfitting) x2: sepal width x1: petal length
DTs in practice... • Growing to purity is bad (overfitting) x2: sepal width x1: petal length
DTs in practice... • Growing to purity is bad (overfitting) • Terminate growth early • Grow to purity, then prune back
Not statistically supportable leaf Remove split & merge leaves DTs in practice... • Growing to purity is bad (overfitting) x2: sepal width x1: petal length
DTs in practice... • Multiway splits are a pain • Entropy is biased in favor of more splits • Correct w/ gain ratio (DH&S Ch. 8.3.2, Eqn 7)
DTs in practice... • Real-valued attributes • rules of form if (x1<3.4) { ... } • How to pick the “3.4”?