1.06k likes | 1.52k Views
Non-Metric Methods. Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University. Non-Metric Descriptions. Nominal data Discrete Without natural notation of similarity or even ordering
E N D
Non-Metric Methods Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University
Non-Metric Descriptions • Nominal data • Discrete • Without natural notation of similarity or even ordering • Property d-tuple • With lists of attributes • e.g., { red, shiny, sweet, small } • i.e., color = red, texture = shiny, taste = sweet, size = small
Non-Metric Descriptions • Strings of nominal attributes • e.g., Base sequences in DNA segments “AGCTTCAGATTCCA” • Might themselves being the output of other component classifiers • e.g., Chinese character recognizer and a neural network for classifying component brush strokes
Non-Metric Methods • Learn categories from non-metric data • Represent structures in strings • Toward discrete problems addressed by • Rule based pattern recognition methods • Syntactic pattern recognition methods
Benefits of Decision Trees • Interpretability • Rapid classification • Through a sequence of simple queries • Natural way to incorporate prior knowledge from human experts
Interpretability • Conjunctions and disjunctions • For any particular test pattern • e.g.,properties:{taste, color, shape, size} • x = { sweet, yellow, thin, medium } • (color = yellow) AND (shape = thin) • For category description • e.g., Apple = (green AND medium) OR (red AND medium) • Rule reduction • e.g., Apple = (medium AND NOT yellow)
Tree Construction • Given • Set D of labeled training data • Set of properties for discriminating patterns • Goal • Organize the tests into a tree
Tree Construction • Split samples progressively into smaller subsets • Pure subset • All samples have the same category label • Could terminate that portion of the tree • Subset with mixture of labels • Decide either to stop or select another property and grow the tree further
CART • Classification and regression trees • A general framework for decision trees • General questions in CART • Number of decision outcomes at a node • Property tested at a node • Declaration of leaf • When and how to prune • Decision of impure leaf node • Handling of missing data
Branching Factor and Binary Decisions • Branching factor (branching ratio) B • Number of links descending from a node • Binary decisions • Every decision can be represented using just binary decision • e.g., query of color (B=3) • color = green? Color = yellow? • Universal expressive power
Fundamental Principle • Prefer decisions leading to a simple, compact tree with few nodes • A version of Occam’s razor • Seek a property query T at each node N • Make the data reaching the immediate descendent nodes as pure as possible • i.e., achieve lowest impurity • Impurity i(N) • Zero if all patterns bear the same label • Large if the categories are equally represented
Entropy Impurity (Information Impurity) • Most popular measure of impurity
Variance Impurity for Two-Category Case • Particular useful in two-category case
Gini Impurity • Generalization of variance impurity • Applicable to two or more categories • Expected error rate at node N
Misclassification Impurity • Minimum probability that a training pattern would be misclassified at N • Most strongly peaked at equal probabilities
Impurity for Two-Category Case *Adjusted in scale and offset for comparison
Heuristic to Choose Query • If entropy impurity is used, the impurity reduction is corresponding to an information gain • Reduction of entropy impurity due to a split can not be greater than 1 bit
Finding Extrema • Nominal attributes • Perform extensive or exhaustive search over all possible subsets of the training set • Real-valued attributes • Use gradient descent algorithms to find a splitting hyperplane • As a one-dimensional optimization problem for binary trees
Tie Breaking • Nominal data • Choose randomly • Real-valued data • Assume a split lying in xl < xs < xu • Choose either the middle point or the weighted average xs = (1-P)xl + Pxu • P is the probability a pattern goes to the “left” under the decision • Computational simplicity may be a determining factor
Greedy Method • Get a local optimum at each node • No assurance that successive locally optimal decisions lead to the global optimum • No guarantee that we will have the smallest tree • For reasonable impurity measure and learning methods • Often continue to split further to get the lowest possible impurity at the leafs
Favoring Gini Impurity to Misclassification Impurity • Example: 90 in w1 and 10 in w2 • Misclassification impurity: 0.1 • Suppose no splits guarantee a w2 majority in either of the two descendent nodes • Misclassification impurity remains at 0.1 for all splits • An attractive split: 70 w1, 0 w2 to the right and 20 w1, 10 w2 to the left • Gini impurity shows that this is a good split
Twoing Criterion • For multiclass binary tree creation • Find “supercategories”C1 and C2 • C1 = {wi1, wi2, …, wik}, C2 = C-C1 • Compute Di(s,C1) as though it corresponding to a standard two-class problem • Find s*(C1) that maximize the change and then the supercategory C1*
Practical Considerations • Choice of impurity function rarely affects the final classifier and its accuracy • Stopping criterion and pruning methods are more important in determining final accuracy
Importance of Stopping Criteria • Fully growing trees have typically been overfit • Extreme case: each leaf corresponds to a single training point • The full tree is merely a look-up table • Not to generalize well in noisy problem • Early stopping • Error on training data not sufficiently low • Performance may suffer
Stopping by Checking Validation Error • Using a subset of the data (e.g., 90%) for training and the remaining (10%) as a validation set • Continue splitting until the error on the validation data is minimized
Stopping by Setting a Threshold • Stop if maxsDi(s) < b • Benefits • Use all training data • Leaf can lie in different levels • Fundamental drawback • Difficult to determine the threshold • An alternative simple method • Stop when a node represents fewer than some • A threshold number of points, • A fixed percentage of the total training set
Stopping by Checking a Global Criterion • Stop when a global criterion is minimum • Minimum description length • Criterion: complexity and uncertainty
Horizon Effect • Determination of optimal split at a node is not influenced by decisions at its descendent nodes • A stopping condition may be met too early for overall optimal recognition accuracy • Biases toward trees in which the greatest impurity is near the root node
Pruning • Grow a tree fully first • All pairs of neighboring leaf nodes are considered for elimination • If the elimination yields a satisfactory (small) increase in impurity, the common antecedent node is declared a leaf • merging or joining
Rule Pruning • Each leaf has an associated rule • Some of rules can be simplified if a series of decisions is redundant • Can improve generalization and interpretability • Allows us to distinguish between contexts in which the node is used
Computation Complexity • Training • Root node • Sorting: O(dn log n) • Entropy computation: O(n)+(n-1)O(d) • Total: O(dn log n) • Level 1 node • Average case: O(dn log (n/2)) • Total number of levels: O(log n) • Total average complexity: O(dn (log n)2) • Recall and classification • O(log n)
Priors and Costs • Priors • Weight samples to correct for the prior frequencies • Costs • Cost matrix lij • Incorporate cost into impurity, e.g.,
Training and Classification with Deficient Patterns • Training • Proceed as usual • Calculate impurities at a node using only the attribute information present • Classification • Use traditional (“primary”) decision whenever possible • Use surrogate splits when test pattern is missing some features • Or use virtual values
Algorithm ID3 • Interactive dichotomizer • For use with nominal (unordered) inputs only • Real-valued variables are handled by bins • Gain ratio impurity is used • Continues until all nodes are pure or there are no more variables • Pruning can be incorporated
Algorithm C4.5 • Successor and refinement of ID3 • Real-valued variables are treated as in CART • Gain ratio impurity is used • Use pruning based on statistical significance of splits