610 likes | 1.11k Views
0. Classification Methods: k-Nearest Neighbor Naïve Bayes. Ram Akella Lecture 4 February 9, 2011 UC Berkeley Silicon Valley Center/SC. 0. Overview. Example The Naïve rule Two data-driven methods (no model) K-nearest neighbors Naïve Bayes. Example: Personal Loan Offer.
E N D
0 Classification Methods: k-Nearest Neighbor Naïve Bayes Ram Akella Lecture 4 February 9, 2011 UC Berkeley Silicon Valley Center/SC
0 Overview • Example • The Naïve rule • Two data-driven methods (no model) • K-nearest neighbors • Naïve Bayes
Example: Personal Loan Offer As part of customer acquisition efforts, Universal bank wants to run a campaign for current customers to purchase a loan. In order to improve target marketing, they want to find customers that are most likely to accept the personal loan offer. They use data from a previous campaign on 5000 customers, 480 of them accepted.
Personal Loan Data Description File: “UniversalBank KNN NBayes.xls”
The Naïve Rule • Classify a new observation as a member of the majority class • In the personal loan example, the majority of customers did not accept the loan
K-Nearest Neighbor: Idea Find the k closest records to the one to be classified, and let them “vote”.
What does the algorithm do? • Computes the distance between the record to be classified and each of records in the training set • Finds the k shortest distances • Computes the vote of these k neighbors • This is repeated for every record in the validation set
Experiment We have 100 training points : 60 pink and 40 blue. Then we have 50 test points, • For each point, we voted, using 5-nearest neighbor How do we measure how well the classifier did? • We compare the predicted with actual value in each of the 50 point validation/test set
0 Distance between 2 observations • Single variable case: each item has 1 value. • Customer 1 has income = 49K • Multivariate case: Each observation is a vector of values. • Customer1 = (Age=25,Exp=1,Income=49,…,CC=0) • Customer2 = (Age=49,Exp=19,Income=34,…,CC=0) • The distance between obs i and j is denoted dij. • Distance Requirements: • Non-negative ( dij > 0 ) • dii = 0 • Symmetry (dij = dji ) • Triangle inequality ( dij + djk dik )
0 Types of Distances Notation: Example: • Customer1=(Age=25,Exp=1, Inc=49, fam=4,CCAvg=1.6) • Customer2=(Age=49,Exp=19,Inc=34, fam=3,CCAvg=1.5)
0 Euclidean Distance • The Euclidean distance between the age of customer1 (25) and customer2 (49): • The Euclidean distance between the two on the 5-dimensions (Age, Exper, Income, Family, CCAvg): • [ (25-49)2 ] = 24 • [ (25-49)2+ (1-19)2 + (49-34)2 + (4-3)2+ (1.6-1.5)2]= =30.82
0 which pair is closest ? • Carry & Sam • Sam & Miranda • Carry & Miranda Carry & Sam: (31.779-32.739)2 + (36-40)2 = 960.00
0 Now, income is in $000. Which pair is closest? • Carry & Sam • Sam & Miranda • Carry & Miranda Sam & Miranda: √(32.739-33.88)2 + (40-38)2 = 5.30
0 Why do we need to standardize the variables? The distance measure is influenced by the units of the different variables, especially if there is a wide variation in units. Variables with “larger” units will influence the distances more than others. The solution: standardize each variable before measuring distances!
0 Other distances • Squared Euclidean distance • Correlation-based distance: the correlation between two vectors of (standardized) items/observations, rij, measures their similarity. We can define a distance measure as dij = 1- rij2 • Statistical distance (no need to standardize) • Manhattan distance (“city-block”) Note: some software use “similarities” instead of “distances”. The only measure that accounts for covariance!
0 1 0 a b 1 c d 0 Distances for Binary Data • Are obtained from the 2x2 table of counts. 0 2 0 1
Choosing the number or neighbors (K) • Too small: under-smoothing • Too large: over-smoothing • Typically k<20 • K should be odd (to avoid ties) Solution: Use validation set to find “best” k
Output We’re using the validation data here to choose the best k
0 Advantages and Disadvantages of K nearest neighbors • The Good • Very flexible, data-driven • Simple • With large amount of data, where predictor levels are well represented, has good performance • Can also be used for continuous y: instead of voting, take average of neighbors (XLMiner: Prediction > K-NN) • The bad • No insight about importance/role of each predictor • Beware of over-fitting! Need a test set • Can be computationally intensive for large k • Need LOTS of data (exponential in #predictors)
Conditional Probability - reminder • A = the event “customer accepts loan” • B = the event “customer has credit card” • denotes the probability of A given B (the conditional probability that A occurs given that B occurred) If P(B)>0
Naïve Bayes • Naive Bayes is one of the most efficient and effective inductive learning algorithms for machine learning and data mining. • It calculates the probability of a point E to belong to a certain class Ci based on its attributes (x1, x2, …, xn) • It assumes that the attributes are conditional independent on the class Ci C x1 x2 xn ….
Illustrative Example • The example E is represented by a set of attribute values (x1, x2, · · · , xn), where xiis the value of attribute Xi. Let C represents the classification variable, and let c be the value of C. • In this example we assume that there are only two classes: + (the positive class) or − (the negative class). • A classifier is a function that assigns a class label to an example. From the probability perspective, according to Bayes Rule, the probability of an example E = (x1, x2, · · · , xn) being class c is
Naïve Bayes Classifier E is classified as the class C = +if and only if: where fb(E) is called a Bayesian classifier. Assume that all attributes are independent given the value of the class variable, that is: The function fb(E) is called a naive Bayesian classifier, or simply naive Bayes (NB).
Augmented Naïve Bayes • Naive Bayes is the simplest form of Bayesian network, in which all attributes are independent given the value of the class variable. • This conditional independence assumption is rarely true in most real-world applications. • A straightforward approach to overcome the limitation of naive Bayes is to extend its structure to represent explicitly the dependencies among attributes.
Augmented Naïve Bayes An augmented naive Bayes (ANB), is an extended classifier, in which the class node directly points to all attribute nodes, and there exist links among attribute nodes. An ANB represents a joint probability distribution represented by: where pa(xi) denotes an assignment to values of the parents of Xi. C x1 x2 Xn-1 xn ….
Why does this classifier work? • The basic idea comes from • In a given dataset, two attributes may depend on each other, but the dependence may distribute evenly in each class. • Clearly, in this case, the conditional independence assumption is violated, but naive Bayes is still the optimal classifier. • What eventually affects the classification is the combination of dependencies among all attributes. • If we just look at two attributes, there may exist strong dependence between them that affects the classification. • When the dependencies among all attributes work together, however, they may cancel each other out and no longer affect the classification.
Why does this classifier work? Definition 1: Given an example E, two classifiers f1 and f2 are said to be equal under zero-one loss on E, if f1(E) ≥ 0 if and only if f2(E) ≥ 0, denoted by f1(E) = f2(E) for every example E in the example space.
Local Dependence Distribution Definition 2: For a node X on ANB, the local dependence derivative of X in classes + and − are defined as: • where dd+G(x|pa(x)) reflects the strength of the local dependence of node X in class +, • This measures the influence of X’s local dependence on the classification in class +. • dd−G (x|pa(x)) is similar for the negative class.
Local Dependence Distribution • When X has no parent, then: dd+ G(x|pa(x)) = dd−G(x|pa(x)) = 1. • When dd+G(x|pa(x)) ≥ 1, • X’s local dependence in class + supports the classification of C = +. Otherwise, it supports the classification of C = − • When dd−G(x|pa(x)) ≥ 1, • X’s local dependence in class − supports the classification of C = −. Otherwise, it supports the classification of C = +.
Local Dependence Distribution When the local dependence derivatives in both classes support the different classifications, the local dependencies in the two classes cancel partially each other out, • The final classification that the local dependence supports, is the class with the greater local dependence derivative. • Another case is that the local dependence derivatives in the two classes support the same classification. Then, the local dependencies in the two classes work together to support the classification.
Local Dependence Derivative Ratio Definition 3 For a node X on ANB G, the local dependence derivative ratio at node X, denoted by ddrG(x)is defined by: ddrG(x) quantifies the influence of X’s local dependence on the classification.
Local Dependence Derivative Ratio We have: • If X has no parents, ddrG(x) = 1. • If dd+G(x|pa(x)) = dd−G (x|pa(x)), ddrG(x) = 1. This means that x’s local dependence distributes evenly in class + and class −. Thus, the dependence does not affect the classification, no matter how strong the dependence is. • If ddrG(x) > 1, X’s local dependence in class + is stronger than that in class −. ddrG(x) < 1 means the opposite.
Global Dependence Distribution Let us explore under what condition an ANB works exactly the same as its correspondent naive Bayes. Theorem 1 Given an ANB G and its correspondent naïve Bayes Gnb (i.e., remove all the arcs among attribute nodes from G) on attributes X1, X2, ..., Xn, assume that fb and fnb are the classifiers corresponding to G and Gnb, respectively. For a given example E = (x1, x2, · · ·, xn), the equation below is true. where the product of ddrG(xi) for i=1..N is called the dependence distribution factor at example E, denoted by DFG(E).
Global Dependence Distribution • Proof:
Global Dependence Distribution Theorem 2 Given an example E = (x1, x2, ..., xn), an ANB G is equal to its correspondent naive Bayes Gnb under zero-one loss if and only if when fb(E) ≥ 1, DFG(E) ≤ fb(E); or when fb(E) < 1, DFG(E) > fb(E).
Global Dependence Distribution Applying the theorem 2 we have the following results: • When DFG(E) = 1, the dependencies in ANB G has no influence on the classification. • The classification of G is exactly the same as that of its correspondent naïve Bayes Gnb. • There exist three cases for DFG(E) = 1. • no dependence exists among attributes. • for each attribute X on G, ddrG(x) = 1; that is, the local distribution of each node distributes evenly in both classes. • the influence that some local dependencies support classifying E into C = +is canceled out by the influence that other local dependences support classifying E into C = −.
Global Dependence Distribution 2. fb(E) = fnb(E) does not require that DFG(E) = 1. The precise condition is given by Theorem 2. That explains why naive Bayes still produces accurate classification even in the datasets with strong dependencies among attributes (Domingos & Pazzani 1997). 3. The dependencies in an ANB flip (change) the classification of its correspondent naive Bayes, only if the condition given by Theorem 2 is no longer true.
Conditions of the optimality of the Naïve Bayes Naive Bayes classifier is optimal if the dependencies among attributes cancel each other out. • The classifier is still optimal even though the dependencies do exist
Optimality of the Naïve Bayes Example: We have two attributes X1 and X2, and assume that the class density is a multivariate Gaussian in both the positive and negative classes. That is: where • x = (x1, x2) • ∑+ and ∑ −are the covariance matrices in the positive and negative classes respectively, • | ∑ − | and | ∑+| are the determinants of ∑ −and ∑ +, • ∑−1 + and ∑−1−are the inverses of ∑ −and ∑ + • μ+ = (μ+1, μ+2 ) and μ−= (μ−1, μ−2 ), • μ+ iand μ−iare the means of attribute Xiin the positive and negative classes respectively, • (x−μ+)Tand (x−μ−)Tare the transposes of (x−μ+) and (x−μ−).
Optimality of the Naïve Bayes We assume: • The two classes have a common covariance matrix ∑+ = ∑−= ∑ , • X1 and X2 have the same variance σ in both classes. Then, when applying a logarithm to the Bayesian classifier, defined previously, we obtain the following fb classifier
Optimality of the Naïve Bayes • Then, because of the conditional independence assumption, we have the correspondent naive Bayesian classifier fnb • Assume that • X1 and X2 are independent if σ12 = 0. If σ ≠σ12, we have:
Optimality of the Naïve Bayes An example E is classified into the positive class by fb, if and only if fb ≥ 0. fnbis similar. When fbor fnbis divided by a non-zero positive constant, the resulting classifier is the same as fbor fnb. Then
Optimality of the Naïve Bayes where a = − (1/σ2)(μ+ + μ−)Σ−1(μ+− μ−), is a constant independent of x. For any x1 and x2, Naive Bayes has the same classification as that of the underlying classifier if:
Optimality of the Naïve Bayes • This is: 1
Optimality of the Naïve Bayes Assuming that: We can simplify the equation to: where 1
Optimality of the Naïve Bayes The shaded area of the figure shows the region in which the Naïve Bayes Classifier is optimal
Example with 2 predictors: CC, Online P(accept =1 | CC=1, online=1) = 50/286 286/3000
P(CC=1, Online=1 | accept=0) is approx • 50/286 • 1-50/286 • 461/3000 • 461/(3000-286) • 129/(3000-286)
Example with 2 predictors: CC, Online P(accept =1 | CC=1, online=1) =
The practical difficulty • We need to have ALL the combinations of predictor categories • CC=1,Online=1 • CC=1, Online=0 • CC=0, Online=1 • CC=0, Online=0 • With many predictors, this is pretty unlikely