270 likes | 287 Views
Learn techniques to deal with numeric attributes, missing values, and noisy data in decision trees. Understand methods for data pruning and converting trees to classification rules.
E N D
Why are simple methods not good enough? • Robustness: Numeric attributes, missing values, and noisy data
Decision Trees • Divide and conquer method • Earlier discussion of this method worked for nominal values. How to deal with numeric attributes? • How to calculate information? • How to prune the trees---so they are not overfitted? • Converting decision trees to classification rules
Decision Trees and Numeric Attributes • Generally, when a numeric attribute is used to split data, it is a binary split. So the same numeric attribute may be tested several times. • Selection of the attribute to select at first once again is based on the information gain: sort the attribute values and determine a breakpoint where the information gain is maximized. • For example, when the values of a numeric attribute are sorted as follows, with the corresponding classification, the information gain at the specified breakpoint (between 62 and 65) is computed as: <45 (C1), 53 (C2), 55 (C2), 62 (C2), 65 (C1), 71 (C1), 75 (C1), 80 (C2)> info([1,3],[3,1]) = 4/8*info([1,3])+4/8*info([3,1]) = info([1,3]) [because info([1,3])=info([3,1])] = -1/4*log(1/4) - 3/4*log(3/4) = 0.811
Decision Trees and Missing Values • Treat it as a separate category if it has some significance • Alternately, choose the most popular branch at a split point when that attribute is missing for a given instance • More sophisticated approach: Notionally split the instance into weights based on the portion of instances that go along that path; whenever it is further split at an intermediate node, the weights are further split; finally, all branches lead to the leaf nodes; the decision is also weighted based on the weights and summed up.
Decision Trees: Pruning • Postpruning (or backward pruning) and prepruning (forward pruning) • Most decision tree builders employ postpruning. • Postpruning involves subtree replacement and subtree raising • Subtree replacement: Select some subtrees and replace them with a single leaf node---see Figure 1.3a • Subtree raising: More complex and not always worthwhile. C4.5 scheme uses it. See Fig. 6.1---generally restricted to the most popular branch.
Postpruning • Subtree replacement and subtree raising • Subtree replacement: • Proceeds from leaves up to the root • Reduces accuracy on the training set but may increase the accuracy on an independently chosen set • Subtree raising: An entire subtree (a popular one) is raised toward the root. • The subtree that is raised is generally the most popular one (see Fig. 6.1) • The leaves of the raised subtree need to be reclassified
Decision Trees: Estimating Error • In making the decision of subtree pruning or subtree raising, we need to know the resulting estimated error. • Keep in mind that training set is just a small subset of an entire universe of data---so the tree should not be fitting just the training data---the error estimation also should take this into account • Method 1: Reduced error pruning: Hold back some of the training data and use it to estimate the error due to pruning---not very good as it reduces the training data • Method 2: Error estimate based on the entire training data
From trees to rules • Given a decision tree, the path from root to each leaf may be turned into a rule. • Using the pruning method, we can generalize the rules also. • Given a particular rule, consider each condition as a candidate for deletion • Determine which of the training examples are now covered by the new rule and compare with the original pessimistic estimate • Continue this process until no more improvement can be made
Classification Rules • Simple separate-and-conquer technique---separate instances that are covered by a rule; form other rules to classify other instances • Problem: Tend to overfit the training data and do not generalize well to independent sets, particularly on noisy data • Criteria for choosing tests (in a rule): • Maximize the correctness: p/t where t is total instances covered by the rule (i.e., satisfies all the LHS of the rule ) out of which p are positive (i.e., RHS is also true). • Based on information gain: p[log(p/t)-log(P/T)] where P is total +ve instances and T total instances before the test was added. • The trade-off is coverage versus accuracy
Classification Rules: Generating Good Rules • Objective: Instead of deriving rules that overfit to the training data, it is best to generate sensible rules that stand a better chance of performing well on new test instances. • Coverage versus accuracy: Should we choose a rule that is true over 15/20 instances or the one that is 2/2 (that is 100% correct)? • Split the training data set into: growing set and pruning set • Use the growing set to form rules using the basic covering algorithm • Then, remove part of a rule and see its effect on the pruning set; if satisfied, remove that part of the test • Postpruning is preferred to prepruning • Algorithm for forming rules by incremental reduced-error pruning • Worth of a rule based on the pruning set: If it gets p instances right out of the t instances it covers, and P is the total right instances out of T. If N= T-p and n= t-p, then (N-n) are the total negative ones it does not cover and p it covers p positive ones. So [p+(N-n)]/T is taken as a metric. • Example: Rule’s total coverage, t=3000; rule gets p=2000 right out of these; so it gets 1000 wrong. So metric=(1000+N)/T; Another rule that has t=1001 and p=1000 with n=1 has the metric value=(999+N)/T. However, 1st rule is less accurate than the 2nd. So the metric is criticized. • See Figure 6.3
Classification Rules: Global Optimization • First generate rules based on incremental reduced-error pruning techniques • Then a global optimization is performed to increase the accuracy of the rules---by revising or replacing individual rules • Postinduction optimization is shown to improve both the size and performance of the rule set • But this process in often complex • RIPPER is a build and optimize algorithm
Classification Rules: Using Partial Decision Trees • Alternative approach to rule induction that avoids global optimization • Combines divide-and-conquer of decision tree learning (p. 62) and separate-and-conquer for rule learning (p. 112) • Separate-and-conquer: It builds a rule, removes the instances it covers, and continues creating rules recursively for the remaining instances until none are left. • To make a single rule, a pruned decision tree is built for the currents set of instances, the leaf with the largest coverage is made into a rule, and the tree discarded • A partial decision tree is an ordinary decision tree that contains branches to undefined subtrees. • Entropy(p1,p2,p3,…,pn)=-p1logp1-p2logp2-…-pnlogpn • Info([a,b,c])=entropy(a/(a+b+c), b/(a+b+c), c/(a+b+c)]) • See Fig. 6.6 for an illustration---at each stage of a-c, the node with the lowest-entropy sibling is chosen for expansion. At (d), pruning takes place replacing node 5 with two leaf nodes. Further, in (e), node 3 is also pruned. Then, node 4 is expanded. • Figure 6.5 is the algorithm
Once a partial tree has been built, a single rule is extracted from it. • Each leaf corresponds to a possible rule, and we seek the best leaf of those subtrees that have been expanded into leaves---choose the leaf that covers the greatest number of instances • If there is a missing attribute value, it is assigned to each of the branches with a weight proportional to the number of training instances going down that branch
Rules with exceptions • First, a class is selected for the top-level rule: select the class with the highest frequency. • Then, a rule is formed pertaining to any class other than the default class. • See Fig. 6.7 The denominator is the instances covered by the rule; and the numerator is the instances for which the class is valid. • Most of the exampels are covered by the high-level rules and the lower-level ones really do represent exceptions.
Extending Linear Models • Basic techniques: • Linear regression (for numeric prediction) • Logistic regression (for linear classification) --- linear model based on transformed target variables---replace Pr[1|a1,a2,…,an] with log(Pr[1|a1,a2,…,an]/(1-Pr[1|a1,a2,…,an]) • Perceptron (for linear classification)---data can be separated perfectly into two groups using a hyperplane---linearly separable--- Equation: w0*a0+w1*a1+…+wn*an=0; a0=1 and all others are attributes of the instance are a1,a2,…,an. If the sum is greater than 1, they predict class 1, otherwise class 2. See Fig. 4.10a to compute w1..wn • Winnow (for linear classification) –for data sets with binary attributes---see fig. 4.11 • Basic problem: The boundaries between classes are not necessarily linear • Support vector machines use linear models to implement nonlinear class boundaries • Transform the input using a nonlinear mapping; map given instance to a new instance • Use linear models in the new space • Transform the boundaries back to original space---they are now nonlinear • Ex: x = w1a13 + w2a12a2 + w3a1a22 + w4a23 • Here, a1and a2 are the attributes; w1, w2, w3, and w4 are to be learned; x is the outcome
SVM • Train one linear system for each class • Assign an unknown instance to the class that gives the greatest output x---like multiresponse linear regression • Problems: Higher computational complexity; danger of overfitting • SVM solves both problems: Use maximum margin hyperplane---hyperplane that gives the greatest separation between the classes---it comes no closer to either than it has to. • The maximum hyperplane is the perpendicular bisector of the shortest line connecting the hulls of the two classes (say yes and no) • The instances that are closest to the maximum margin hyperplane are called support vectors; there is always at least one (if not more) support vector for each class • Given the support vectors for the two classes (say yes and no), the maximum margin hyperplane can be easily constructed
Support Vector Regression • Basic regression---find a function that approximates the training data well by minimizing the prediction error (e.g., MSE) • What is special about SVR: all deviations up to a user specified parameter ε are simply discarded. • Also, what is minimized is absolute error (|a-x|) rather than MSE • The value of ε controls how closely the function fits the training data: too small an ε leads to overfitting; too large an ε leads to meaningless predictions. • See Fig. 6.9
Instance-based Learning • Basic scheme: Use nearest neighbor technique • Tends to be slow for large training sets • Performs badly with noisy data---the class of a single instance is based on its single nearest neighbor rather than on averaging • No weights associated to different attributes---generally some may have larger effect than others • Does not perform explicit generalization • Reducing the number of exemplars • Already seen instances that are used for classification are referred to as exemplars (or examples) • Classify each example with the examples already seen and save only the ones that don’t fit the current ones---expand examples only when necessary • Problem: Noisy examples are likely to be classified as new examples
Pruning noisy exemplars: • For a given k, choose the k nearest neighbors and assign the majority class to the unknown instance • Alternately, monitor the performance of the stored exemplars---keep the ones that do well (match well) and discard the rest • IB3---Instance-based learner version 3 --- uses 5% confidence level for acceptance and 12.5% for rejection. Criterion for acceptance is more stringent than for rejection making it more difficult for an instance to be accepted • Weighting attributes: Use w1, w2, …, wn as weights in computing the Euclidean distance metric for the n attributes. (see page 238) • All attribute weights are updated after each training instance is classified and the most similar exemplar is used as the basis for updating. • Suppose x is the training instance and y the most similar exemplar---then for each attribute I, |xi-yi| is a measure of the contribution of that attribute to the decision. If the difference is small, contribution is more. • See page 238 for details of changing the attribute weights
Generalizing exemplars: • These are rectangular regions of exemplars---called hyperrectangles • Now, when classifying new instances, it is necessary to calculate the distance based on its distance to the hyperrectangle • When a new exemplar is classified correctly, it is generalized simply by merging it with the nearest exemplar of the same class • If the nearest exemplar is a single instance, a new hyperrectangle is created that covers both exemplars. • Otherwise, the hyperrectangle is modified to cover the new one • If the prediction is incorrect, the hyperrectangle’s boundaries are shrunk so it is separated from the instance that was misclassified
Distance Functions for Generalized Exemplars • Generalized exemplars are no longer points; instead they are hyperrectangles. So the distance of an instance from an exemplar is computed as follows. • If the point lies within the hyperrectangle, the distance is zero. • Measure the distance from the outside point instance to the nearest point on the hyperrectangle boundary or measure the distance between the outside point and the nearest instance that is within the hyperrectangle. • In case hyperrectangles overlap, choose a hyperrectangle that is most specific or the one that covers the smallest instance space.
Numeric Prediction • Model tree: Used for numeric prediction. Predicts the class value of instances that reach a leaf • Regression tree: Here, the leaf node represents average value of all instances that reach that node---a special case of model trees • Model trees: • Each leaf will contain a linear model based on some of the attribute values, and is used to yield a raw predicted value for a test instance • The raw value can be smoothed out by producing linear models for each internal node as well as for the leaves, at the time the tree is built. Once a raw value has been obtained at the leaf node, it is filtered along the path back to the root • Smoothing occurs at each internal node by combining it with the value predicted by the linear model for that node. Function: p’ = (np+kq)/(n+k) where p’ is the new prediction, p the prediction from the node below, n is the number of training instances that reach the node below and k is a smoothing constant. • Alternately, the leaf node’s model can be modified to reflect the smoothing that takes place at the internal nodes.
Building the tree: Here, similar to information gain used with nominal attributes, expected error reduction is chosen as a metric to choose an attribute on which to split a tree. • SDR = Std. Dev. Reduction = sd(T) - ∑ |Ti|/|T|*sd(Ti) where T is the tree prior to splitting and Tis are the subtrees formed after splitting choosing a particular attribute to split. In other words, the idea is to choose an attribute that reduces the variance after a split. • Splitting process terminates when the class value std. deviation is small fraction of the std. dev of the original instance.
Pruning the tree: • First, a linear model is calculated for each node of the unpruned tree. • Only the attributes tested subtree below this node are used in the regression---assume that all attributes are numeric • Once a linear model is in place for each interior node, the tree is pruned back from the leaves as long as the expected estimated error decreases • In case there are nominal attributes, they are converted to binary variables that are treated as numeric. If a nominal value has k possible values, it is replaced by k-1 synthetic binary attributes
Clustering • Basic scheme: k-means clustering • Choosing k? • MDL: Minimum description length principle • Occam’s razor: Other things being equal, simple things are better than complex ones. • General theory + exceptions = what we learn • MDL principle: Best theory for a body of data is one that minimizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory • The one that minimizes the #of bits required to communicate the generalization, along with the examples from which it is made (i.e., training set)
Bayesian Networks • Naïve Bayes classifier: For each class value, estimate the probability that a given instance belongs to that class. • More advanced: Bayesian networks---a network of nodes, one for each attribute, connected by directed edges; a directed acyclic graph • Fig 6.20 and Fig. 6.21