Classification II

Classification II

Training Dataset Example

Output: A Decision Tree for “buys_computer” age? overcast <=30 >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes

+ – – – – – + – + + – – – + + + – – – + – – – + – – – – – – – + + – + – + + – + – – – + + – + + + – + – – – – + – – – – – + + – + – – + + + – – Choice of attribute – + – – – – – + + – + – – + + + – – We prefer splits that lead to “pure” partitions purity: class labels are homogeneous

Selecting the best split • Best splitis selected based on the degree of impurity of the child nodes • Class distribution (0,1) has high purity • Class distribution (0.5,0.5) has the smallest purity (highest impurity) • Intuition:high purity  small value of impurity measures  better split

Algorithm for Decision Tree Induction (pseudocode) Algorithm GenDecTree(Sample S, Attlist A) • create a node N • If all samples are of the same class C then label N with C; terminate; • If A is empty then label N with the most common class C in S (majority voting); terminate; • Select aA, with the highest impurity reduction; Label N with a; • For each value v of a: • Grow a branch from N with condition a=v; • Let Sv be the subset of samples in S with a=v; • If Sv is empty then attach a leaf labeled with the most common class in S; • Else attach the node generated by GenDecTree(Sv, A-a)

Attribute Selection Measure: Information Gain • Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| • - where Ci, Ddenotes the set of tuples that belong to class Ci • Expected information (entropy) needed to classify a tuple in D: • - where m is the number of classes

Attribute Selection Measure: Information Gain • Informationneeded (after using A to split D into v partitions) to classify D: • Information gained by branching on attribute A

Attribute Selection: Information Gain samples:yes no • Class P: buys_computer = “yes” • Class N: buys_computer = “no”

Splitting the samples using age age? >40 <=30 30...40 labeled yes

Giniindex • If a data set D contains examples from n classes, gini index, gini(D) is defined as - where pj is the relative frequency of class j in D • If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as

Giniindex • Reduction in Impurity: • The attribute that provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node

Comparing Attribute Selection Measures • The two measures, in general, return good results but • Both are biased towards multivalued attributes • Gini may have difficulty when # of classes is large • Gini tends to favor test sets that result in equal-sized partitions and purity in both partitions

Is minimizing impurity/ maximizing Δ enough? • The Δ gain function favors attributes with large number of values • A test condition with large number of outcomes may not be desirable • # of records in each partition may become too small to make predictions

Gain ratio • Amodification that reduces its bias on high-branch attributes • Gain ratio should be • Large when data is evenly spread to many branches • Small when all data belongs to one branch • Takes number and size of branches into account when choosing an attribute • It corrects Δ by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to)

Gain ratio • Gain ratio = Δ/Splitinfo • SplitInfo = -Σi=1…kP(vi)log(P(vi)) • k: total number of branches (splits) • P(vi): probability that an instance belongs to a branch • If each branch of the split has the same number of records: • P(vi) = 1/kand SplitInfo = logk • Large number of splits  largeSplitInfo small gain ratio

Decision boundary for decision trees • Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary in decision trees is parallel to axes because test condition involves a single attribute at-a-time

Oblique Decision Trees ? Not all datasets can be partitioned optimally using test conditions using single attributes!

Oblique Decision Trees Test on multiple attributes If x+y< 1 then red class Not all datasets can be partitioned optimally using test conditions using single attributes!

Oblique Decision Trees 500 circular and 500 triangular data points. Circular points: 0.5  sqrt(α2+y2)  1 Triangular points: sqrt(α2+y2) >1or sqrt(α2+y2) < 0.5

Overfitting due to noise Decision boundary is distorted by noise point

Overfitting due to insufficient samples x: class 1 : class 2 o: test samples Why?

Overfitting due to insufficient samples x: class 1 : class 2 o: test samples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

Overfitting and Tree Pruning • Overfitting: An induced tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Poor accuracy for unseen samples • Two approaches to avoid overfitting

Overfitting and Tree Pruning • Two approaches to avoid overfitting • Prepruning: Halt tree construction early • do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold • Postpruning: Remove branches from a “fully grown” tree • prune a sub-tree if the classification error is smaller after pruning • get a sequence of progressively pruned trees

Pros and Cons of decision trees • Cons • Cannot handle complicated relationships between features • Simple decision boundaries • Problems with lots of missing data • Constructing the optimal decision tree: NP-complete • Pros • Reasonable training time • Fast application • Easy to interpret • Easy to implement • Can handle large number of features

Some well-known decisiontree learning implementations CARTBreiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth ID3Quinlan JR (1986) Induction of decision trees. Machine Learning 1:81–106 C4.5 Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann J48 Implementation of C4.5 in WEKA

Imputation: • An estimation of the missing value or of its distribution is used to generate predictions from a given model: • a missing value is replaced with an estimation of the value, or • the distribution of possible missing values is estimated and corresponding model predictions are combined probabilistically. Handling missing values

Remove attributes with missing values • Remove examples with missing values • Assume most frequent value • Assume most frequent value given a class • Learn the distribution of a given attribute • Induce relationships between the available attribute values and the missing feature Handling missing values

Imputing Missing Values • Expectation Minimization (EM): • Build model of data values (ignore missing values) • Use model to estimate missing values • Build new model of data values (including estimated values from previous step) • Use new model to re-estimate missing values • Re-estimate model • Repeat until convergence

Potential Problems • Imputed values may be inappropriate: • in medical databases, if missing values not imputed separately for male and female patients, may end up with male patients with 1.3 prior pregnancies, and female patients suffering from prostate infection • many of these situations will not be so obvious • If some attributes are difficult to predict, filled-in values may be random (or worse)

What is Bayesian Classification? • Bayesian classifiers are statistical classifiers • For each new sample they provide a probability that the sample belongs to a class (for all classes)

Bayes’ Theorem: Basics • Let X be a data sample (“evidence”): class label is unknown • Let H be a hypothesis that X belongs to class C • Classification is to determine P(H|X) • the probability that the hypothesis holds given the observed data sample X

Bayes’ Theorem: Basics • P(H) (prior probability): • the initial probability • E.g., anyX will buy computer, regardless of age, income, … • P(X): probability that sample data Xis observed • P(X|H) (posteriori probability): • the probability of observing the sample X, given that the hypothesis holds • E.g.,Given that X will buy computer, the prob. that X is 31..40, medium income

Bayes’ Theorem • Given training dataX, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem • Informally, this can be written as posteriori = likelihood x prior/evidence • Predicts X belongs to Ciiff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes • Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Towards Naïve Bayesian Classifiers • D: training set of tuples and their associated class labels • X = (α1, α2, …, αn): each tuple is represented by an n-dimensional attribute vector • Suppose there are m classes C1, C2, …, Cm. • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)

Towards Naïve Bayesian Classifiers • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) • This can be derived from Bayes’ theorem • Since P(X) is constant for all classes, only needs to be maximized

Derivation of Naïve Bayesian Classifier • A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): • Each xkis a potential value of attribute αk • This greatly reduces the computation cost: Only counts the class distribution

Derivation of Naïve Bayesian Classifier • If Ak is categorical • P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) • If Ak is continous-valued: • P(xk|Ci) is computed based on a Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci): • where

Naive Bayesian Classifier Example play tennis?

Naive Bayesian Classifier Example 9 5

Naive Bayesian Classifier Example • Given the training set, we compute the following probabilities P(xk | Ci) : • We also have the prior class probabilities • P(Ci = P) = 9/14 • P(Ci = Ν) = 5/14

Avoiding the 0-Probability Problem • Naïve Bayesian prediction requires each conditional probability to be non-zero • Otherwise, the predicted probability will be zero:

Avoiding the 0-Probability Problem • Ex. Suppose a dataset with: • 1000 tuples • income=low (0) • income= medium (990) • income = high (10) • Use Laplacian correction (or Laplacian estimator) • Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 • The “corrected” probability estimates are close to their “uncorrected” counterparts

NBC: Comments • Advantages • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence, therefore loss of accuracy • Practically, dependencies exist among variables • How to deal with these dependencies? • Bayesian Belief Networks

The perceptron Input: each example xi has a set of attributes xi = {α1, α2, …, αm} and is of class yi Estimated classification output:ui Task: express each sample xi as a weighted combination (linear combination) of the attributes <w,x>: the inner or dotproduct of w and x How to learn the weights ?

The perceptron online + - f(x) can also be written as a linear combination of all training examples +

The perceptron - - - + + - - - + - + - - + - + + + + - + + + + - + + + The perceptron learning algorithm is guaranteed to find a separating hyperplane– if there is one. separating hyperplane

Classification II