Classification

Classification

Classification task • Input: a training set of tuples, each labeled with one class label • Output: a model (classifier) that assigns a class label to each tuple based on the other attributes • The model can be used to predict the class of new tuples, for which the class label is missing or unknown

What is Classification • Data classification is a two-step process • first step: a model is built describing a predetermined set of data classes or concepts • second step: the model is used for classification • Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label attribute • Data tuples are also referred to as samples, examples, or objects

Train and test • The tuples (examples, samples) are divided into training set + test set • Classification model is built in two steps: • training - build the model from the training set • test - check the accuracy of the model using test set

Train and test • Kind of models: • if - then rules • logical formulae • decision trees • Accuracy of models: • the known class of test samples is matched against the class predicted by the model • accuracy rate = % of test set samples correctly classified by the model

Training step Classification algorithm training data Classifier (model) if age < 31 or Car Type =Sports then Risk = High

Test step Classifier (model) test data

Classification (prediction) Classifier (model) new data

Classification vs. Prediction • There are two forms of data analysis that can be used to extract models describing data classes or to predict future data trends: • classification: predict categorical labels • prediction: models continuous-valued functions

Comparing Classification Methods (1) • Predictive accuracy: this refers to the ability of the model to correctly predict the class label of new or previously unseen data • Speed: this refers to the computation costs involved in generating and using the model • Robustness: this is the ability of the model to make correct predictions given noisy data or data with missing values

Comparing Classification Methods (2) • Scalability: this refers to the ability to construct the model efficiently given large amount of data • Interpretability: this refers to the level of understanding and insight that is provided by the model • Simplicity: • decision tree size • rule compactness • Domain-dependent quality indicators

Problem formulation Given records in the database with class label – find model for each class. Age < 31 Car Type is sports High High Low

Classification techniques • Decision Tree Classification • Bayesian Classifiers • Neural Networks • Statistical Analysis • Genetic Algorithms • Rough Set Approach • k-nearest neighbor classifiers

Classification by Decision Tree Induction • A decision tree is a tree structure, where • each internal node denotes a test on an attribute, • each branch represents the outcome of the test, • leaf nodes represent classes or class distributions Age < 31 N Y Car Type is sports High High Low

Decision Tree Induction (1) • A decision tree is a class discriminator that recursively partitions the training set until each partition consists entirely or dominantly of examples from one class. • Each non-leaf node of the tree contains a split point, which is a test on one or more attributes and determines how the data is partitioned

Decision Tree Induction (2) • Basic algorithm: a greedy algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner. • Many variants: • from machine learning (ID3, C4.5) • from statistics (CART) • from pattern recognition (CHAID) • Main difference: split criterion

Decision Tree Induction (3) • The algorithm consists of two phases: • Build an initial tree from the training data such that each leaf node is pure • Prune this tree to increase its accuracy on test data

Tree Building • In the growth phase the tree is built by recursively partitioning the data until each partition is either "pure" (contains members of the same class) or sufficiently small. • The form of the split used to partition the data depends on the type of the attribute used in the split: • for a continuous attribute A, splits are of the form value(A)<x where x is a value in the domain of A. • for a categorical attribute A, splits are of the form value(A)X where Xdomain(A)

Tree Building Algorithm Make Tree (Training Data T) { Partition(T) } Partition(Data S) { if (all points in S are in the same class) then return for each attribute A do evaluate splits on attribute A; use best split found to partition S into S1 and S2 Partition(S1) Partition(S2) }

Tree Building Algorithm • While growing the tree, the goal at each node is to determine the split point that "best" divides the training records belonging to that leaf • To evaluate the goodness of the split some splitting indices have been proposed

Split Criteria • Gini index (CART, SPRINT) • select attribute that minimize impurity of a split • Information gain (ID3, C4.5) • to measure impurity of a split use entropy • select attribute that maximize entropy reduction • 2 contingency table statistics (CHAID) • measures correlation between each attribute and the class label • select attribute with maximal correlation

Gini index (1) Given a sample training set where each record represents a car-insurance applicant. We want to build a model of what makes an applicant a high or low insurance risk. Classifier (model) Training set The model built can be used to screen future insurance applicants by classifying them into the High or Low risk categories

Gini index (2) SPRINT algorithm: Partition(Data S) { if (all points in S are of the same class) then return; for each attribute A do evaluate splits on attribute A; Use best split found to partition S into S1 and S2 Partition(S1); Partition(S2); } Initial call: Partition(Training Data)

Gini index (3) • Definition: gini(S) = 1 - pj2 where: • S is a data set containing examples from n classes • pj is a relative frequency of class j in S • E.g. two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. ppos= p/(p+n) pneg = n/(n+p) gini(S) = 1 - ppos2 - pneg2

Gini index (4) • If dataset S is split into S1 and S2, then splitting index is defined as follows: giniSPLIT(S) = (p1+ n1)/(p+n)*gini(S1) + (p2+ n2)/(p+n)* gini(S2) where p1, n1 (p2, n2) denote p1 Pos-elements and n1 Neg-elements in the dataset S1 (S2), respectively. • In this definition the "best" split point is the one with the lowest value of the giniSPLIT index.

Example (1) Training set

Example (1) Attribute list for ‘Age’ Attribute list for ‘Car Type’

Example (2) • Possible values of a split point for Age attribute are: Age17, Age20, Age23, Age32, Age43, Age68 • G(Age<=17) = 1- (12+02) = 0 • G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25 • GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5

Example (3) • G(Age<=20) = 1- (12+02) = 0 • G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2 • GSPLIT = (2/6) * 0 + (4/6) * (1/8) = 1/3 • G(Age23) = 1- (12+02) = 0 • G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9 • GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9

Example (4) • G(Age32) = 1- ((3/4)2+(1/4)2) = 1 - (10/16) = 6/16 = 3/8 • G(Age>32) = 1- ((1/2)2+(1/2)2) = 1/2 • GSPLIT = (4/6)*(3/8) + (2/6)*(1/2) = (1/8) + (1/6)=14/48= 7/24 The lowest value of GSPLIT is for Age23, thus we have a split point at Age=(23+32) / 2 = 27.5

Example (5) Decision tree after the first split of the example set: Age27.5 Age>27.5 Risk = High Risk = Low

Example (6) Attribute lists are divided at the split point. Attribute lists for Age27.5: Attribute lists for Age>27.5

Example (7) Evaluating splits for categorical attributes We have to evaluate splitting index for each of the 2N combinations, where N is the cardinality of the categorical attribute. G(Car type {sport}) = 1 – 12 – 02 = 0 G(Car type {family}) = 1 – 02 – 12 = 0 G(Car type {truck}) = 1 – 02 – 12 = 0

Example (8) G(Car type  { sport, family }) = 1 - (1/2)2 - (1/2)2 = 1/2 G(Car type  { sport, truck }) = 1/2 G(Car type  { family, truck }) = 1 - 02 - 12 = 0 GSPLIT(Car type  { sport }) = (1/3) * 0 + (2/3) * 0 = 0 GSPLIT(Car type  { family }) = (1/3) * 0 + (2/3)*(1/2) = 1/3 GSPLIT(Car type  { truck }) = (1/3) * 0 + (2/3)*(1/2) = 1/3 GSPLIT(Car type  { sport, family}) = (2/3)*(1/2)+(1/3)*0= 1/3 GSPLIT(Car type  { sport, truck}) = (2/3)*(1/2)+(1/3)*0= 1/3 GSPLIT(Car type  { family, truck }) = (2/3)*0+(1/3)*0=0

Example (9) The lowest value of GSPLIT is for Car type  {sport}, thus this is our split point. Decision tree after the second split of the example set: Age27.5 Age>27.5 Risk = High Car type  {family, truck} Car type  {sport} Risk = High Risk = Low

Information Gain (1) • The information gain measure is used to select the test attribute at each node in the tree • The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node • This attribute minimizes the information needed to classify the samples in the resulting partitions

Information Gain (2) • Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m classes, Ci (for i=1, ..., m) • Let si be the number of samples of S in class Ci • The expected information needed to classify a given sample is given by I(s1, s2, ..., sm) = -  pi log2(pi) where pi is the probability that an arbitrary sample belongs to class Ci and is estimated by si/s.

Information Gain (3) • Let attribute A have v distinct values, {a1, a2, ..., av}. Attribute A can be used to partition S into {S1, S2, ..., Sv}, where Sj contains those samples in S that have value aj of A • If A were selected as the test attribute, then these subsets would correspond to the branches grown from the node containing the set S

Information Gain (4) • Let sij be the number of samples of the class Ci in a subset Sj. The entropy, or expected information based on the partitioning into subsets by A, is given by: E(A1, A2, ...Av) = (s1j + s2j +...+smj)/s* * I(s1j, s2j, ..., smj) • The smaller the entropy value, the greater the purity of the subset partitions.

Information Gain (5) • The term (s1j + s2j +...+smj)/s acts as the weight of the jth subset and is the number of samples in the subset (i.e. having value aj of A) divided by the total number of samples in S. Note that for a given subset Sj, I(s1j, s2j, ..., smj) = -  pij log2(pij) where pij = sij/|Sj| and is the probability that a sample in Sj belongs to class Ci

Information Gain (6) The encoding information that would be gained by branching on A is Gain(A) = I(s1, s2, ..., sm) – E(A) Gain(A) is the expected reduction in entropy caused by knowing the value of attribute A

Example (1)

Example (2) • Let us consider the following training set of tuples taken from the customer database. • The class label attribute, buys_computer, has two distinct values (yes, no), therefore, there are two classes (m=2). C1 correspond to yes – s1 = 9 C2 correspond to no - s2 = 5 I(s1, s2)=I(9, 5)= - 9/14log29/14 – 5/14log25/14=0.94

Example (3) • Next, we need to compute the entropy of each attribute. Let start with the attribute age for age=‘<=30’ s11=2 s21=3 I(s11, s21) = 0.971 for age=’31..40’ s12=4 s22=0 I(s12, s22) = 0 for age=‘>40’ s13=2 s23=3 I(s13, s23) = 0.971

Example (4) The entropy of age is, E(age)=5/14 *I(s11, s21) +4/14* I(s12, s22) + + 5/14* I(s13, s23) = 0.694 The gain in information from such a partitioning would be: Gain(age) = I(s1, s2) – E(age) = 0.246

Example (5) • We can compute Gain(income)=0.029, Gain(student)=0.151, and Gain(credit_rating)=0.048 Since age has the highest information gain amont the attributes, it is selected as the test atribute. A node is created and labeled with age, and branches are grown for each of the attribute’s values.

Example (6) age >40 <=30 31..40 buys_computers: yes, no buys_computers: yes, no buys_computers: yes

Example (7) age <=30 >40 31..40 credit_rating student yes yes excellent fair no yes no yes no

Entropy vs. Gini index • Entropy tends to fin groups of classes that add up to 50% of the data • Gini index tends to isolate the largest class from all other classes class A 40 class B 30 class C 20 class D 10 class A 40 class B 30 class C 20 class D 10 if age < 65 if age < 40 no yes yes no class A 40 class D 10 class B 30 class C 20 class B 30 class C 20class D 10 class A 40

Tree pruning • When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. • Tree pruning methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the tree to correctly classify independent test data

Classification

Classification

Presentation Transcript

Classification

Classification

Classification

Classification

Classification

Classification

Classification

Classification

CLASSIFICATION

Classification

Classification Techniques: Bayesian Classification

CLASSIFICATION

Classification

Classification

Classification

Classification

Classification