Classification

Classification CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

Bayesian Classification: Why? • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayesian Theorem: Basics • Let X be a data sample whose class label is unknown • Let H be a hypothesis that X belongs to class C • For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X • P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) • P(X): probability that sample data is observed • P(X|H) : probability of observing the sample X, given that the hypothesis holds

Bayesian Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem • Informally, this can be written as posterior =likelihood x prior / evidence • MAP (maximum posteriori) hypothesis • Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent: • The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) • No dependence relation between attributes • Greatly reduces the computation cost, only count the class distribution. • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci)

Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)

Naïve Bayesian Classifier: Example • Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes”)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes”

Naïve Bayesian Classifier: Comments • Advantages : • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence , therefore loss of accuracy • Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc • Dependencies among these cannot be modeled by Naïve Bayesian Classifier • How to deal with these dependencies? • Bayesian Belief Networks

Y Z P Bayesian Networks • Bayesian belief network allows a subset of the variables conditionally independent • A graphical model of causal relationships • Represents dependency among the variables • Gives a specification of joint probability distribution • Nodes: random variables • Links: dependency • X,Y are the parents of Z, and Y is the parent of P • No dependency between Z and P • Has no loops or cycles X

Bayesian Belief Network: An Example Family History Smoker (FH, ~S) (~FH, S) (~FH, ~S) (FH, S) LC 0.7 0.8 0.5 0.1 LungCancer Emphysema ~LC 0.3 0.2 0.5 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Bayesian Belief Networks

Learning Bayesian Networks • Several cases • Given both the network structure and all variables observable: learn only the CPTs • Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning • Network structure unknown, all variables observable: search through the model space to reconstruct graph topology • Unknown structure, all hidden variables: no good algorithms known for this purpose • D. Heckerman, Bayesian networks for data mining

Customer buys both Customer buys diaper Customer buys beer Association Rules • Itemset X = {x1, …, xk} • Find all the rules X  Ywith minimum support and confidence • support, s, is the probability that a transaction contains X  Y • confidence,c, is the conditional probability that a transaction having X also contains Y • Let supmin = 50%, confmin = 50% • Association rules: • A  C (60%, 100%) • C  A (60%, 75%)

Classification based on Association • Classification rule mining versus Association rule mining • Aim • A small set of rules as classifier • All rules according to minsup and minconf • Syntax • X  y • X Y

Why & How to Integrate • Both classification rule mining and association rule mining are indispensable to practical applications. • The integration is done by focusing on a special subset of association rules whose right-hand-side are restricted to the classification class attribute. • CARs: class association rules

CBA: Three Steps • Discretize continuous attributes, if any • Generate all class association rules (CARs) • Build a classifier based on the generated CARs.

Our Objectives • To generate the complete set of CARs that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. • To build a classifier from the CARs.

Rule Generator: Basic Concepts • Ruleitem <condset, y> :condset is a set of items, y is a class label Each ruleitem represents a rule: condset->y • condsupCount • The number of cases in D that contain condset • rulesupCount • The number of cases in D that contain the condset and are labeled with class y • Support=(rulesupCount/|D|)*100% • Confidence=(rulesupCount/condsupCount)*100%

RG: Basic Concepts (Cont.) • Frequent ruleitems • A ruleitem is frequent if its support is above minsup • Accurate rule • A rule is accurate if its confidence is above minconf • Possible rule • For all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems. • The set of class association rules (CARs) consists of all the possible rules (PRs) that are both frequent and accurate.

RG: An Example • A ruleitem:<{(A,1),(B,1)},(class,1)> • assume that • the support count of the condset (condsupCount) is 3, • the support of this ruleitem (rulesupCount) is 2, and • |D|=10 • then (A,1),(B,1) -> (class,1) • supt=20% (rulesupCount/|D|)*100% • confd=66.7% (rulesupCount/condsupCount)*100%

RG: The Algorithm 1 F 1 = {large 1-ruleitems}; 2 CAR 1 = genRules (F 1 ); 3 prCAR 1 = pruneRules (CAR 1 ); //count the item and class occurrences to determine the frequent 1-ruleitems and prune it 4 for (k = 2; F k-1Ø; k++) do • C k= candidateGen (F k-1 ); //generate the candidate ruleitems Ck using the frequent ruleitems Fk-1 6 for each data case d D do //scan the database • C d = ruleSubset (C k, d); //find all the ruleitems in Ck whose condsets are supported by d 8 for each candidate c C d do 9 c.condsupCount++; 10 if d.class = c.class then c.rulesupCount++; //update various support counts of the candidates in Ck 11 end 12 end

RG: The Algorithm(cont.) • F k = {c C k| c.rulesupCountminsup}; //select those new frequent ruleitems to form Fk 14 CAR k = genRules(F k); //select the ruleitems both accurate andfrequent 15 prCAR k= pruneRules(CAR k); 16 end 17 CARs = k CAR k; 18 prCARs = k prCAR k;

Small Margin Large Margin Support Vectors SVM – Support Vector Machines

SVM – Cont. • Linear Support Vector Machine Given a set of points with label The SVM finds a hyperplane defined by the pair (w,b) (where w is the normal to the plane and b is the distance from the origin) s.t. x – feature vector, b- bias, y- class label, 2/||w|| - margin

SVM – Cont.

(0,1) + + - + -1 0 +1 - + (1,0) (0,0) SVM – Cont. • What if the data is not linearly separable? • Project the data to high dimensional space where it is linearly separable and then we can use linear SVM – (Using Kernels)

Non-Linear SVM Classification using SVM (w,b) In non linear case we can see this as Kernel – Can be thought of as doing dot product in some high dimensional space

Example of Non-linear SVM

Results

SVM Related Links • http://svm.dcs.rhbnc.ac.uk/ • http://www.kernel-machines.org/ • C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. • SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light • BOOK: An Introduction to Support Vector MachinesN. Cristianini and J. Shawe-TaylorCambridge University Press

Classification

Classification

Presentation Transcript

Classification

Classification

Classification

Classification

Classification

Classification

Classification

Classification

CLASSIFICATION

Classification

Classification Techniques: Bayesian Classification

CLASSIFICATION

Classification

Classification

Classification

Classification

Classification