Decision Trees and Numeric Attributes

Decision Trees and Numeric Attributes • Generally, when a numeric attribute is used to split data, it is a binary split. So the same numeric attribute may be tested several times. • Selection of the attribute to select at first once again is based on the information gain: sort the attribute values and determine a breakpoint where the information gain is maximized. • For example, when the values of a numeric attribute are sorted as follows, with the corresponding classification, the information gain at the specified breakpoint (between 62 and 65) is computed as: <45 (C1), 53 (C2), 55 (C2), 62 (C2), 65 (C1), 71 (C1), 75 (C1), 80 (C2)> info([1,3],[3,1]) = 4/8*info([1,3])+4/8*info([3,1]) = info([1,3]) = -1/4*log(1/4) - 3/4*log(3/4) = 0.811

Decision Trees and Missing Values • Treat is as a separate category if it has some significance • Alternately, choose the most popular branch at a split point when that attribute is missing for a given instance • More sophisticated approach: Notionally split the instance into weights based on the portion of instances that go along that path; whenever it is further split at an intermediate node, the weights are further split; finally, all branches lead to the leaf nodes; the decision is also weighted based on the weights and summed up.

Decision Trees: Pruning • Postpruning (or backward pruning) and prepruning (forward pruning) • Most decision tree builders employ postpruning. • Postpruning involves subtree replacement and subtree raising • Subtree replacement: Select some subtrees and replace them with a single leaf node---see Figure 1.3a • Subtree raising: More complex and not always worthwhile. C4.5 scheme uses it. See Fig. 6.1---generally restricted to the most popular branch.

Decision Trees: Estimating Error • In making the decision of subtree pruning or subtree raising, we need to know the resulting estimated error. • Keep in mind that training set is just a small subset of an entire universe of data---so the tree should not be fitting just the training data---the error estimation also should take this into account • Method 1: Reduced error pruning: Hold back some of the training data and use it to estimate the error due to pruning---not very good as it reduces the training data • Method 2: Error estimate based on the entire training data

Classification Rules • Simple separate-and-conquer technique • Problem: Tend to overfit the training data and do not generalize well to independent sets, particularly on noisy data • Criteria for choosing tests (in a rule): • Maximize the correctness: p/t where t is total instances covered by the rule out of which p are positive. • Based on information gain: p[log(p/t)-log(P/T)] where P is total +ve instances and T total instances before the rule was applied. • Test 1 places more importance on correctness rather than coverage; Test 2 is also concerned about coverage • Missing values: Best to treat them as if the values on which the missing values are being tested do not match; this way they may match on other attributes in other rules • Numeric attributes: Sort the attribute values and use break points to make rules

Classification Rules: Generating Good Rules • Objective: Instead of deriving rules that overfit to the training data, it is best to generate sensible rules that stand a better chance of performing well on new test instances. • Coverage versus accuracy: Should we choose a rule that is true over 15/20 instances or the one that is 2/2 (that is 100% correct)? • Split the training data set into: growing set and pruning set • Use the growing set to form rules • Then, remove part of a rule and see its effect on the pruning set; if satisfied, remove that part of the test • Algorithm for forming rules by incremental reduced-error pruning • Worth of a rule based on the pruning set: If it gets p instances right out of the t instances it covers, and P is the total right instances out of T. If N= T-p and n= t-p, then (N-n) are the total negative ones it does not cover and p it covers p positive ones. So [p+(N-n)]/T is taken as a metric.

Classification Rules: Global Optimization • First generate rules based on incremental reduced-error pruning techniques • Then a global optimization is performed to increase the accuracy of the rules---by revising or replacing individual rules • Postinduction optimization is shown to improve both the size and performance of the rule set • But this process in often complex • RIPPER is a build and optimize algorithm

Classification Rules: Using Partial Decision Trees • Alternative approach to rule induction that avoids global optimization • Combines divide-and-conquer of decision tree learning (p. 62) and separate-and-conquer for rule learning (p. 112) • Separate-and-conquer: • Build a rule • Remove the instances it covers • Repeat above steps recursively for the remaining instances until none are left. • It differs from the standard approach in the following way: To make a single rule, a pruned decision tree is built for the currents set of instances, the leaf with the largest coverage is made into a rule, and the tree discarded • A partial decision tree is an ordinary decision tree that contains branches to undefined subtrees. • Entropy(p1,p2,p3,…,pn)=-p1logp1-p2logp2-…-pnlogpn • Info([a,b,c])=entropy(a/(a+b+c), b/(a+b+c), c/(a+b+c)])

Fig 6.5– Algorithm • Using the information-gain heuristic (used in decision trees), split a set of test instances into subsets. • Expand subsets in increasing order of entropy---a subset with low average entropy is more likely to result in a small subtree and is likely to produce a more general rule • Repeat this step recursively until a subset is expanded into a leaf • Then continue further by backtracking. • Once an internal node is encountered whose children have all been expanded into leaves, check whether this node may be replaced by a single leaf. • If during backtracking, a node is encountered whose all child nodes are not expanded to leaves, the algorithm stops. • Each leaf corresponds to a single rule, and the best leaf (covering the maximum instances) is chosen.

Classification rules: Rules with Exceptions • First, choose a default class for the top-level-rule --- typically the one with the highest frequency (See Fig 6.7. 50/150 indicates that 50 instances out of total 150 are true for this rule) • Split training data into those that satisfy the rule and those that don’t. • For those that don’t, repeat the algorithm recursively • In Fig. 6.7, the horizontal dashed lines show exceptions and vertical solid lines show alternates • In Iris data, each of the three classes (setosa, versicolor, virginica) have equal coverage of 50/150. Arbitrarily, choose Iris setosa as default class (50/150 satisfy this). • For those 100/150 that don’t satisfy the default, there are two alternatives • This step is repeated, until no more exceptions occur • Advantage: Most of the rules are covered by the high-level rules and the low-level rules do represent exceptions.

Extending Linear Models • The idea is to transform the original attributes into a transformed space. • For example, if a model with 2 attributes were to consider all 2 factor products of the two attributes, say a1 and a2, then we would have: a12, a1a2, a22 as three synthetic attributes. The three new attributes may be linearly combined using three weights. • For each class, generate a linear model, and for any given test instance, choose the class that yields the highest value (that fits the best). • Problem: Computational complexity and overfitting---if there rae n attributes and we want to have non-linear term of degree d, then there would be nd + (n-1)d + (n-2)d + … + 1d when n and d are large, this could be an explosive number. For example, if n=10 and d =2, we could have (n)(n+1)(2n+1)/6 = 10*11*21/6 = 385 synthetic attributes. With d=3, it is n2(n+1)2/4 = 100*121/4 = 3025 synthetic attributes!!

Support Vector Regression • Basic regression---find a function that approximates the training data well by minimizing the prediction error (e.g., MSE) • What is special about SVR: all deviations up to a user specified parameter ε are simply discarded. • Also, what is minimized is absolute error rather than MSE • The value of ε controls how closely the function fits the training data: too small an ε leads to overfitting; too large an ε leads to meaningless predictions. • See Fig. 6.9

Instance-based Learning • Basic scheme: Use nearest neighbor technique • Tends to be slow for large training sets • Performs badly with noisy data---the class of a single instance is based on its single nearest neighbor rather than an averaging • No weights associated to different attributes---generally some may have larger effect than others • Does not perform explicit generalization • Reducing the number of exemplars • Already seen instances that are used for classification are referred to as exemplars • Classify each example with the examples already seen and save only the ones that don’t fit the current ones---expand examples only when necessary • Problem: Noisy examples are likely to be classified a snew examples

Pruning noisy exemplars: • For a given k, choose the k nearest neighbors and assign the majority class to the unknown instance • Alternately, monitor the performance of the stored exemplars---keep the ones that do well (match well) and discard the rest • IB3---Instance-based learner version 3 --- uses 5% confidence level for acceptance and 1.25% for rejection. Criterion for acceptance is more stringent than for rejection making it more difficult for an instance to be accepted • Weighting attributes: Use w1, w2, …, wn as weights in computing the Euclidean distance metric for the n attributes. (see page 238) • All attribute weights are updated after each training instance is classified and the most similar exemplar is used as the basis for updating. • Suppose x is the training instance and y the most similar exemplar---then for each attribute I, |xi-yi| is a measure of the contribution of that attribute to the decision. If the difference is small, contribution is more. • See page 238 for details of changing the attribute weights

Generalizing exemplars: • These are rectangular regions of exemplars---called hyperrectangles • Now, when classifying new instances, it is necessary to calculate the distance based on its distance to the hyperrectangle • When a new exemplar is classified correctly, it is generalized simply by merging it with the nearest exemplar of the same class • If the nearest exemplar is a single instance, a new hyperrectangle is created that covers both exemplars. • Otherwise, the hyperrectangle is modified to cove rthe new one • If the prediction is in correct, the hyperrectangle’s boundaries are shrunk so it is separated from the the instnace that was misclassified

Decision Trees and Numeric Attributes

Decision Trees and Numeric Attributes

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees and Decision Tables

Decision Trees

Decision Trees

Decision Trees

Handling Numeric Attributes in Hoeffding Trees

Decision Trees

Decision Trees

Decision Trees with Numeric Tests

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees