Classification II

Classification II

Data Mining Overview • Data Mining • Data warehouses and OLAP (On Line Analytical Processing.) • Association Rules Mining • Clustering: Hierarchical and Partitional approaches • Classification: Decision Trees and Bayesian classifiers

Setting • Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Classifier Decision rules Age Salary Profession Location Customer type Salary > 5 L Good/ bad Prof. = Exec New applicant’s data

Good Good Bad Bad Decision trees • Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Salary < 1 M Prof = teaching Age < 30

SLIQ (Supervised Learning In Quest) • Decision-tree classifier for data mining • Design goals: • Able to handle large disk-resident training sets • No restrictions on training-set size

Building tree GrowTree(TrainingData D) Partition(D); Partition(Data D) if(all points in D belong to the same class) then return; for each attribute A do evaluate splits on attribute A; use best split found to partition D into D1 and D2; Partition(D1); Partition(D2);

Data Setup • One list for each attribute • Entries in an Attribute List consist of: • attribute value • class list index • A list for the classes with pointers to the tree nodes • Lists for continuous attributes are in sorted order • Attribute lists may be disk-resident • Class List must be in main memory

Data Setup Class list Attribute lists N1

Evaluating Split Points • Gini Index • if data D contains examples from c classes Gini(D) = 1 - pj2 where pj is the relative frequency of class j in D • If D split into D1 & D2 with n1 & n2 tuples each Ginisplit(D) = n1* gini(D1) + n2* gini(D2) n n • Note: Only class frequencies are needed to compute index

Finding Split Points • For each attribute A do • evaluate splits on attribute A using attribute list • Key idea: To evaluate a split on numerical attributes we need to sort the set at each node. But, if we have all attributes pre-sorted we don’t need to do that at the tree construction phase • Keep split with lowest GINI index

Finding Split Points: Continuous Attrib. • Consider splits of form: value(A) < x • Example: Age < 17 • Evaluate this split-form for every value in an attribute list • To evaluate splits on attribute A for a given tree-node: Initialize class-histograms of left and right children; for each record in the attribute list do find the corresponding entry in Class List and the class and Leaf node evaluate splitting index for value(A) < record.value; update the class histogram in the leaf

N1 GINI Index: und 0 0 1 0.33 3 1 4 3 0.22 1: Age < 20 3: Age < 32 Age < 32 4: Age < 43 0.5 4

Finding Split Points: Categorical Attrib. • Consider splits of the form: value(A)  {x1, x2, ..., xn} • Example: CarType {family, sports} • Evaluate this split-form for subsets of domain(A) • To evaluate splits on attribute A for a given tree node: initialize class/value matrix of node to zeroes; for each record in the attribute list do increment appropriate count in matrix; evaluate splitting index for various subsets using the constructed matrix;

class/value matrix Left Child Right Child GINI Index: CarType in {family} GINI = 0.444 CarType in {sports} GINI = 0.333 CarType in {truck} GINI = 0.267

Updating the Class List For each attribute A in a split traverse the attribute list for each value u in the attr list find the corresponding entry in the class list (e) find the new class c to which u belongs update the class list for e to c update node reference in e to the node corresponding to class c • Next step is to update the Class List with the new nodes • Scan the attr list that is used to split and update the corresponding leaf entry in the Class List

Preventing overfitting • A tree T overfits if there is another tree T’ that gives higher error on the training data yet gives lower error on unseen data. • An overfitted tree does not generalize to unseen instances. • Happens when data contains noise or irrelevant attributes and training size is small. • Overfitting can reduce accuracy drastically: • 10-25% as reported in Minger’s 1989 Machine learning

Approaches to prevent overfitting • Two Approaches: • Stop growing the tree beyond a certain point • First over-fit, then post prune. (More widely used) • Tree building divided into phases: • Growth phase • Prune phase • Hard to decide when to stop growing the tree, so second appraoch more widely used.

Criteria for finding correct final tree size: • Three criteria: • Cross validation with separate test data • Use some criteria function to choose best size • Example: Minimum description length (MDL) criteria • Statistical bounds: use all data for training but apply statistical test to decide right size.

The minimum description length principle (MDL) • MDL: paradigm for statistical estimation particularly model selection • Given data D and a class of models M, our choose is to choose a model m in M such that data and model can be encoded using the smallest total length • L(D) = L(D|m) + L(m) • How to find encoding length? • Answer in Information Theory • Consider the problem of transmitting n messages where pi is probability of seeing message i • Shannon’s theorem: minimum expected length when • -log pi bits to message i

Encoding data • Assume t records of training data D • First send tree m using L(m|M) bits • Assume all but the class labels of training data known. • Goal: transmit class labels using L(D|m) • If tree correctly predicts an instance, 0 bits • Otherwise, log k bits where k is number of classes. • Thus, if e errors on training data: total cost • e log k + L(m|M) bits. • Complex tree will have higher L(m|M) but lower e. • Question: how to encode the tree?

Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

SPRINT • An improvement over SLIQ • Does not need to keep a list in main memory • Parallel version is straightforward • Attribute lists are extended with class field – no Class list is needed • Uses hashing to assign records to classes and nodes

Pros and Cons of decision trees • Cons • Cannot handle complicated relationship between features • simple decision boundaries • problems with lots of missing data • Pros • Reasonable training time • Fast application • Easy to interpret • Easy to implement • Can handle large number of features More information: http://www.recursive-partitioning.com/

Classification II