160 likes | 274 Views
Decision Trees and Numeric Attributes. Generally, when a numeric attribute is used to split data, it is a binary split. So the same numeric attribute may be tested several times.
E N D
Decision Trees and Numeric Attributes • Generally, when a numeric attribute is used to split data, it is a binary split. So the same numeric attribute may be tested several times. • Selection of the attribute to select at first once again is based on the information gain: sort the attribute values and determine a breakpoint where the information gain is maximized. • For example, when the values of a numeric attribute are sorted as follows, with the corresponding classification, the information gain at the specified breakpoint (between 62 and 65) is computed as: <45 (C1), 53 (C2), 55 (C2), 62 (C2), 65 (C1), 71 (C1), 75 (C1), 80 (C2)> info([1,3],[3,1]) = 4/8*info([1,3])+4/8*info([3,1]) = info([1,3]) = -1/4*log(1/4) - 3/4*log(3/4) = 0.811
Decision Trees and Missing Values • Treat is as a separate category if it has some significance • Alternately, choose the most popular branch at a split point when that attribute is missing for a given instance • More sophisticated approach: Notionally split the instance into weights based on the portion of instances that go along that path; whenever it is further split at an intermediate node, the weights are further split; finally, all branches lead to the leaf nodes; the decision is also weighted based on the weights and summed up.
Decision Trees: Pruning • Postpruning (or backward pruning) and prepruning (forward pruning) • Most decision tree builders employ postpruning. • Postpruning involves subtree replacement and subtree raising • Subtree replacement: Select some subtrees and replace them with a single leaf node---see Figure 1.3a • Subtree raising: More complex and not always worthwhile. C4.5 scheme uses it. See Fig. 6.1---generally restricted to the most popular branch.
Decision Trees: Estimating Error • In making the decision of subtree pruning or subtree raising, we need to know the resulting estimated error. • Keep in mind that training set is just a small subset of an entire universe of data---so the tree should not be fitting just the training data---the error estimation also should take this into account • Method 1: Reduced error pruning: Hold back some of the training data and use it to estimate the error due to pruning---not very good as it reduces the training data • Method 2: Error estimate based on the entire training data
Classification Rules • Simple separate-and-conquer technique • Problem: Tend to overfit the training data and do not generalize well to independent sets, particularly on noisy data • Criteria for choosing tests (in a rule): • Maximize the correctness: p/t where t is total instances covered by the rule out of which p are positive. • Based on information gain: p[log(p/t)-log(P/T)] where P is total +ve instances and T total instances before the rule was applied. • Test 1 places more importance on correctness rather than coverage; Test 2 is also concerned about coverage • Missing values: Best to treat them as if the values on which the missing values are being tested do not match; this way they may match on other attributes in other rules • Numeric attributes: Sort the attribute values and use break points to make rules
Classification Rules: Generating Good Rules • Objective: Instead of deriving rules that overfit to the training data, it is best to generate sensible rules that stand a better chance of performing well on new test instances. • Coverage versus accuracy: Should we choose a rule that is true over 15/20 instances or the one that is 2/2 (that is 100% correct)? • Split the training data set into: growing set and pruning set • Use the growing set to form rules • Then, remove part of a rule and see its effect on the pruning set; if satisfied, remove that part of the test • Algorithm for forming rules by incremental reduced-error pruning • Worth of a rule based on the pruning set: If it gets p instances right out of the t instances it covers, and P is the total right instances out of T. If N= T-p and n= t-p, then (N-n) are the total negative ones it does not cover and p it covers p positive ones. So [p+(N-n)]/T is taken as a metric.
Classification Rules: Global Optimization • First generate rules based on incremental reduced-error pruning techniques • Then a global optimization is performed to increase the accuracy of the rules---by revising or replacing individual rules • Postinduction optimization is shown to improve both the size and performance of the rule set • But this process in often complex • RIPPER is a build and optimize algorithm
Classification Rules: Using Partial Decision Trees • Alternative approach to rule induction that avoids global optimization • Combines divide-and-conquer of decision tree learning (p. 62) and separate-and-conquer for rule learning (p. 112) • Separate-and-conquer: • Build a rule • Remove the instances it covers • Repeat above steps recursively for the remaining instances until none are left. • It differs from the standard approach in the following way: To make a single rule, a pruned decision tree is built for the currents set of instances, the leaf with the largest coverage is made into a rule, and the tree discarded • A partial decision tree is an ordinary decision tree that contains branches to undefined subtrees. • Entropy(p1,p2,p3,…,pn)=-p1logp1-p2logp2-…-pnlogpn • Info([a,b,c])=entropy(a/(a+b+c), b/(a+b+c), c/(a+b+c)])
Fig 6.5– Algorithm • Using the information-gain heuristic (used in decision trees), split a set of test instances into subsets. • Expand subsets in increasing order of entropy---a subset with low average entropy is more likely to result in a small subtree and is likely to produce a more general rule • Repeat this step recursively until a subset is expanded into a leaf • Then continue further by backtracking. • Once an internal node is encountered whose children have all been expanded into leaves, check whether this node may be replaced by a single leaf. • If during backtracking, a node is encountered whose all child nodes are not expanded to leaves, the algorithm stops. • Each leaf corresponds to a single rule, and the best leaf (covering the maximum instances) is chosen.
Classification rules: Rules with Exceptions • First, choose a default class for the top-level-rule --- typically the one with the highest frequency (See Fig 6.7. 50/150 indicates that 50 instances out of total 150 are true for this rule) • Split training data into those that satisfy the rule and those that don’t. • For those that don’t, repeat the algorithm recursively • In Fig. 6.7, the horizontal dashed lines show exceptions and vertical solid lines show alternates • In Iris data, each of the three classes (setosa, versicolor, virginica) have equal coverage of 50/150. Arbitrarily, choose Iris setosa as default class (50/150 satisfy this). • For those 100/150 that don’t satisfy the default, there are two alternatives • This step is repeated, until no more exceptions occur • Advantage: Most of the rules are covered by the high-level rules and the low-level rules do represent exceptions.
Extending Linear Models • The idea is to transform the original attributes into a transformed space. • For example, if a model with 2 attributes were to consider all 2 factor products of the two attributes, say a1 and a2, then we would have: a12, a1a2, a22 as three synthetic attributes. The three new attributes may be linearly combined using three weights. • For each class, generate a linear model, and for any given test instance, choose the class that yields the highest value (that fits the best). • Problem: Computational complexity and overfitting---if there rae n attributes and we want to have non-linear term of degree d, then there would be nd + (n-1)d + (n-2)d + … + 1d when n and d are large, this could be an explosive number. For example, if n=10 and d =2, we could have (n)(n+1)(2n+1)/6 = 10*11*21/6 = 385 synthetic attributes. With d=3, it is n2(n+1)2/4 = 100*121/4 = 3025 synthetic attributes!!
Support Vector Regression • Basic regression---find a function that approximates the training data well by minimizing the prediction error (e.g., MSE) • What is special about SVR: all deviations up to a user specified parameter ε are simply discarded. • Also, what is minimized is absolute error rather than MSE • The value of ε controls how closely the function fits the training data: too small an ε leads to overfitting; too large an ε leads to meaningless predictions. • See Fig. 6.9
Instance-based Learning • Basic scheme: Use nearest neighbor technique • Tends to be slow for large training sets • Performs badly with noisy data---the class of a single instance is based on its single nearest neighbor rather than an averaging • No weights associated to different attributes---generally some may have larger effect than others • Does not perform explicit generalization • Reducing the number of exemplars • Already seen instances that are used for classification are referred to as exemplars • Classify each example with the examples already seen and save only the ones that don’t fit the current ones---expand examples only when necessary • Problem: Noisy examples are likely to be classified a snew examples
Pruning noisy exemplars: • For a given k, choose the k nearest neighbors and assign the majority class to the unknown instance • Alternately, monitor the performance of the stored exemplars---keep the ones that do well (match well) and discard the rest • IB3---Instance-based learner version 3 --- uses 5% confidence level for acceptance and 1.25% for rejection. Criterion for acceptance is more stringent than for rejection making it more difficult for an instance to be accepted • Weighting attributes: Use w1, w2, …, wn as weights in computing the Euclidean distance metric for the n attributes. (see page 238) • All attribute weights are updated after each training instance is classified and the most similar exemplar is used as the basis for updating. • Suppose x is the training instance and y the most similar exemplar---then for each attribute I, |xi-yi| is a measure of the contribution of that attribute to the decision. If the difference is small, contribution is more. • See page 238 for details of changing the attribute weights
Generalizing exemplars: • These are rectangular regions of exemplars---called hyperrectangles • Now, when classifying new instances, it is necessary to calculate the distance based on its distance to the hyperrectangle • When a new exemplar is classified correctly, it is generalized simply by merging it with the nearest exemplar of the same class • If the nearest exemplar is a single instance, a new hyperrectangle is created that covers both exemplars. • Otherwise, the hyperrectangle is modified to cove rthe new one • If the prediction is in correct, the hyperrectangle’s boundaries are shrunk so it is separated from the the instnace that was misclassified