390 likes | 515 Views
Classification. Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan. Predictive Modeling. Goal: learn a mapping: y = f ( x ; ) Need: 1. A model structure 2. A score function 3. An optimization strategy
E N D
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan
Predictive Modeling Goal: learn a mapping: y = f(x;) Need: 1. A model structure 2. A score function 3. An optimization strategy Categorical y {c1,…,cm}: classification Real-valued y: regression Note: usually assume {c1,…,cm} are mutually exclusive and exhaustive
Probabilistic Classification Let p(ck) = prob. that a randomly chosen object comes from ck Objects from ck have: p(x |ck ,k) (e.g., MVN) Then: p(ck | x ) p(x |ck ,k) p(ck) Bayes Error Rate: • Lower bound on the best possible error rate
Classifier Types Discrimination: direct mapping from x to {c1,…,cm} - e.g. perceptron, SVM, CART Regression: model p(ck | x ) - e.g. logistic regression, CART Class-conditional: model p(x |ck ,k) - e.g. “Bayesian classifiers”, LDA
Simple Two-Class Perceptron Define: Classify as class 1 if h(x)>0, class 2 otherwise Score function: # misclassification errors on training data For training, replace class 2 xj’s by -xj; now need h(x)>0 Initialize weight vector Repeat one or more times: For each training data point xi If point correctly classified, do nothing Else Guaranteed to converge when there is perfect separation
Linear Discriminant Analysis K classes, Xn × p data matrix. p(ck | x ) p(x |ck ,k) p(ck) Could model each class density as multivariate normal: LDA assumes for all k. Then: This is linear in x.
Linear Discriminant Analysis (cont.) It follows that the classifier should predict: “linear discriminant function” If we don’t assume the k’s are identicial, get Quadratic DA:
Linear Discriminant Analysis (cont.) Can estimate the LDA parameters via maximum likelihood:
LDA QDA
LDA (cont.) • Fisher is optimal if the class are MVN with a common covariance matrix • Computational complexity O(mp2n)
Logistic Regression Note that LDA is linear in x: Linear logistic regression looks the same: But the estimation procedure for the co-efficicents is different. LDA maximizes joint likelihood [y,X]; logistic regression maximizes conditional likelihood [y|X]. Usually similar predictions.
Logistic Regression MLE For the two-class case, the likelihood is: The maximize need to solve (non-linear) score equations:
Logistic Regression Modeling South African Heart Disease Example (y=MI) Wald
Tree Models • Easy to understand • Can handle mixed data, missing values, etc. • Sequential fitting method can be sub-optimal • Usually grow a large tree and prune it back rather than attempt to optimally stop the growing process
Training Dataset This follows an example from Quinlan’s ID3
Output: A Decision Tree for “buys_computer” age? <=30 overcast >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes
Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left
Information Gain (ID3/C4.5) • Select the attribute with the highest information gain • Assume there are two classes, P and N • Let the set of examples S contain p elements of class P and n elements of class N • The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as e.g. I(0.5,0.5)=1; I(0.9,0.1)=0.47; I(0.99,0.01)=0.08;
Information Gain in Decision Tree Induction • Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv} • If Si contains piexamples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is • The encoding information that would be gained by branching on A
Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: Hence Similarly Attribute Selection by Information Gain Computation
Gini Index (IBM IntelligentMiner) • If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. • If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as • The attribute provides the smallest ginisplit(T) is chosen to split the node
Avoid Overfitting in Classification • The generated tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Result is in poor accuracy for unseen samples • Two approaches to avoid overfitting • Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold • Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree”
Approaches to Determine the Final Tree Size • Separate training (2/3) and testing (1/3) sets • Use cross validation, e.g., 10-fold cross validation • Use minimum description length (MDL) principle: • halting growth of the tree when the encoding is minimized
Nearest Neighbor Methods • k-NN assigns an unknown object to the most common class of its k nearest neighbors • Choice of k? (bias-variance tradeoff again) • Choice of metric? • Need all the training to be present to classify a new point (“lazy methods”) • Surprisingly strong asymptotic results (e.g. no decision rule is more than twice as accurate as 1-NN)
Naïve Bayes Classification Recall: p(ck |x) p(x| ck)p(ck) Now suppose: Then: Equivalently: C … x1 x2 xp “weights of evidence”
Naïve Bayes (cont.) • Despite the crude conditional independence assumption, works well in practice (see Friedman, 1997 for a partial explanation) • Can be further enhanced with boosting, bagging, model averaging, etc. • Can relax the conditional independence assumptions in myriad ways (“Bayesian networks”)
Dietterich (1999) Analysis of 33 UCI datasets