360 likes | 574 Views
Bayesian Probability Bayes’ Rule Naïve Bayesian Classification. Overview. Probability. Let P(A) represent the probability that proposition A is true. Example: Let Risky represent that a customer is a high credit risk.
E N D
Bayesian Probability Bayes’ Rule Naïve BayesianClassification Overview Data Mining: Concepts and Techniques
Probability • Let P(A) represent the probability that proposition A is true. • Example: Let Risky represent that a customer is a high credit risk. • P(Risky) = 0.519 means that there is a 51.9% chance a given customer is a high-credit risk. • Without any other information, this probability is called the prior or unconditional probability Data Mining: Concepts and Techniques
Random Variables • Could also consider a random variableX, which can take on one of many values in its domain <x1,x2,…,xn> • Example: Let Weather be a random variable with domain <sunny, rain, cloudy, snow>. • The probabilities of Weather taking on one of these values is P(Weather=sunny)=0.7 P(Weather=rain)=0.2 P(Weather=cloudy)=0.08 P(Weather=snow)=0.02 Data Mining: Concepts and Techniques
Conditional Probability • Probabilities of events change when we know something about the world • The notation P(A|B) is used to represent the conditional or posterior probability of A • Read “the probability of A given that all we know is B.” P(Weather = snow | Temperature = below freezing) = 0.10 Data Mining: Concepts and Techniques
Axioms of Probability • All probabilities are between 0 and 1 • 0P(A) 1 • Necessarily true propositions have prob. of 1, necessarily false prob. of 0 • P(true) = 1 P(false) = 0 • The probability of a disjunction is given by • P(AB) = P(A) + P(B) - P(AB) Data Mining: Concepts and Techniques
Axioms of Probability • We can use logical connectives for probabilities • P(Weather = snow Temperature = below freezing) • Can use disjunctions (or) or negation (not) as well • The product rule • P(A B) = P(A|B)P(B) = P(B|A)P(A) Data Mining: Concepts and Techniques
P(AB) P(A) P(B) Bayes Theorem - 1 • Consider the Venn diagram at right. The area of the rectangle is 1, and the area of each region gives the probability of the event(s) associated with that region • P(A|B) means “the probability of observing event A given that event B has already been observed”, i.e. • how much of the time that we see B do we also see A? (i.e. the ratio of the purple region to the magenta region) P(A|B) = P(AB)/P(B), and alsoP(B|A) = P(AB)/P(A), therefore P(A|B) = P(B|A)P(A)/P(B) (Bayes formula for two events) Data Mining: Concepts and Techniques
Bayes Theorem - 2 More formally, • Let X be the sample data (evidence) • Let H be a hypothesis that X belongs to class C • In classification problems we wish to determine the probability that H holds given the observed sample data X • i.e. we seek P(H|X), which is known as the posterior probability of H conditioned on X Data Mining: Concepts and Techniques
Bayes Theorem - 3 • P(H) is the prior probability • Similarly, P(X|H) is the posterior probability of X conditioned on H • Bayes Theorem (from earlier slide) is then Data Mining: Concepts and Techniques
Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by backpropagation • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary Data Mining: Concepts and Techniques
Bayesian Classification: Why? • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • Foundation: Based on Bayes’ Theorem. • Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Data Mining: Concepts and Techniques
Bayesian Theorem • Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem • MAP (maximum posteriori) hypothesis • Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes • Practical difficulty: require initial knowledge of many probabilities, significant computational cost Data Mining: Concepts and Techniques
Towards Naïve Bayesian Classifier • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) • Suppose there are m classes C1, C2, …, Cm. • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X). This can be derived from Bayes’ theorem • Since P(X) is constant for all classes, only needs to be maximized • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci) Data Mining: Concepts and Techniques
Derivation of Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): • This greatly reduces the computation cost: Only counts the class distribution • If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) • If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is Data Mining: Concepts and Techniques
Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sampleXthe class labelCsuch thatP(C|X) is maximal Data Mining: Concepts and Techniques
Play-tennis example: estimating P(xi|C) Data Mining: Concepts and Techniques
Play-tennis example: classifying X • An unseen sample X = <rain, hot, high, false> • P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 • P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 • Sample X is classified in class n (don’t play) Data Mining: Concepts and Techniques
Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) Data Mining: Concepts and Techniques
Naïve Bayesian Classifier: Example • Compute P(X/Ci) for each classP(age=“<30” | buys_computer=“yes”) = 2/9=0.222P(age=“<30” | buys_computer=“no”) = 3/5 =0.6P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4P(student=“yes” | buys_computer=“yes)= 6/9 =0.667P(student=“yes” | buys_computer=“no”)= 1/5=0.2P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes” Data Mining: Concepts and Techniques
Children Income Status Many Medium DEFAULTS Many Low DEFAULTS Few Medium PAYS Few High PAYS ApplicantID City 1 Philly 2 Philly 3 Philly 4 Philly Example 3 Take the following training data, from bank loan applicants: • As our attributes are all categorical in this case, we obtain our probabilities using simple counts and ratios: • P[City=Philly | Status = DEFAULTS] = 2/2 = 1 • P[City=Philly | Status = PAYS] = 2/2 = 1 • P[Children=Many | Status = DEFAULTS] = 2/2 = 1 • P[Children=Few | Status = DEFAULTS] = 0/2 = 0 • etc. Data Mining: Concepts and Techniques 20
Example 3 Summarizing, we have the following probabilities: and P[Status = DEFAULTS] = 2/4 = 0.5 P[Status = PAYS] = 2/4 = 0.5 The probability of Income=Medium given the applicant DEFAULTs = the number of applicants with Income=Medium who DEFAULT divided by the number of applicants who DEFAULT = 1/2 = 0.5 Data Mining: Concepts and Techniques 21
Example 3 Now, assume a new example is presented where City=Philly, Children=Many, and Income=Medium: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[H1|E] = P[E|H1].P[H1] (denominator omitted*) P[Status = DEFAULTS | Philly,Many,Medium] = P[Philly|DEFAULTS] x P[Many|DEFAULTS] x P[Medium|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0.5 x 0.5 = 0.25 Then we estimate the likelihood that the example is a payer, given its attributes: P[H2|E] =P[E|H2].P[H2] (denominator omitted*) P[Status = PAYS | Philly,Many,Medium] = P[Philly|PAYS] x P[Many|PAYS] x P[Medium|PAYS] x P[PAYS] = 1 x 0 x 0.5 x 0.5 = 0 As the conditional likelihood of being a defaulter is higher (because 0.25 > 0), we conclude that the new example is a defaulter. *Note: We haven’t divided by P[Philly,Many,Medium] in the calculations above, as that doesn’t affect which of the two likelihoods is higher, as its applied to both, so it doesn’t affect our result!) Data Mining: Concepts and Techniques 22
Example 3 Now, assume a new example is presented where City=Philly, Children=Many, and Income=High: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[Status = DEFAULTS | Philly,Many,High] = P[Philly|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0 x 0.5 = 0 Then we estimate the likelihood that the example is a payer, given its attributes: P[Status = PAYS | Philly,Many,High] = P[Philly|PAYS] x P[Many|PAYS] x P[High|PAYS] x P[PAYS] = 1 x 0 x 0 x 0.5 = 0 As the conditional likelihood of being a defaulter is the same as that for being a payer, we can come to no conclusion for this example. Data Mining: Concepts and Techniques 23
TransactionID Income Credit Decision 1 Very High Excellent AUTHORIZE 2 High Good AUTHORIZE 3 Medium Excellent AUTHORIZE 4 High Good AUTHORIZE 5 Very High Good AUTHORIZE 6 Medium Excellent AUTHORIZE 7 High Bad REQUEST ID 8 Medium Bad REQUEST ID High Bad 9 REJECT 10 Low Bad CALL POLICE Example 4 Take the following training data, for credit card authorizations: Source: Adapted from Dunham Assume we’d like to determine how to classify a new transaction, with Income = Medium and Credit=Good. Data Mining: Concepts and Techniques 24
Example 4 Our conditional probabilities are: Our class probabilities are: P[Decision = AUTHORIZE] = 6/10 P[Decision = REQUEST ID] = 2/10 P[Decision = REJECT] = 1/10 P[Decision = CALL POLICE] = 1/10 Data Mining: Concepts and Techniques 25
Example 4 Our goal is now to work out, for each class, the conditional probability of the new transaction (with Income=Medium & Credit=Good) being in that class. The class with the highest probability is the classification we choose. Our conditional probabilities (again, ignoring Bayes’s denominator) are: P[Decision = AUTHORIZE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=AUTHORIZE] x P[Credit=Good|Decision=AUTHORIZE] x P[Decision=AUTHORIZE] = 2/6 x 3/6 x 6/10 = 36/360 = 0.1 P[Decision = REQUEST ID | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REQUEST ID] x P[Credit=Good|Decision=REQUEST ID] x P[Decision=REQUEST ID] = 1/2 x 0/2 x 2/10 = 0 P[Decision = REJECT | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REJECT] x P[Credit=Good|Decision=REJECT] x P[Decision=REJECT] = 0/1 x 0/1 x 1/10 = 0 P[Decision = CALL POLICE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=CALL POLICE] x P[Credit=Good|Decision=CALL POLICE] x P[Decision=CALL POLICE] = 0/1 x 0/1 x 1/10 = 0 The highest of these probabilities is the first, so we conclude that the decision for our new transaction should be AUTHORIZE. Data Mining: Concepts and Techniques 26
Example 5 Data Mining: Concepts and Techniques
Example 5 • Let D=<unknown, low, none, 15-35> • Which risk category is D in? • Three hypotheses: Risk=low, Risk=moderate, Risk=high • Because of naïve assumption, calculate individual probabilities and then multiply together. Data Mining: Concepts and Techniques
Example 5 P(CH=unknown | Risk=low) = 2/5 P(D|Risk=low)=2/5*3/5*3/5*0/5=0 P(CH=unknown | Risk=moderate) = 1/3 P(D|Risk=moderate)=1/3*1/3*2/3*2/3=4/81=0.494 P(CH=unknown | Risk=high) = 2/6 P(D|Risk=high)=2/6*2/6*6/6*2/6=48/1296=0.370 P(Debt=low | Risk=low) = 3/5 P(Debt=low | Risk=moderate) = 1/3 P(Risk=low)=5/14 P(Debt=low | Risk=high) = 2/6 P(Risk=moderate)=3/14 P(Coll=none | Risk=low) = 3/5 P(Risk=high)=6/14 P(Coll=none | Risk=moderate) = 2/3 P(Coll=none | Risk=high) = 6/6 P(D|Risk=low)P(Risk=low) = 0*5/14 = 0 P(Inc=15-35 | Risk=low) = 0/5 P(D|Risk=moderate)P(Risk=moderate)=4/81*3/14=0.0106 P(Inc=15-35 | Risk=moderate) = 2/3 P(D|Risk=high)P(Risk=high)=48/1296*6/14=0.0159 P(Inc=15-35 | Risk=high) = 2/6 Data Mining: Concepts and Techniques
Avoiding the 0-Probability Problem • Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero • Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), • Use Laplacian correction (or Laplacian estimator) • Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 • The “corrected” prob. estimates are close to their “uncorrected” counterparts Data Mining: Concepts and Techniques
Naïve Bayesian Classifier: Comments • Advantages : • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence , therefore loss of accuracy • Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc • Dependencies among these cannot be modeled by Naïve Bayesian Classifier • How to deal with these dependencies? • Bayesian Belief Networks Data Mining: Concepts and Techniques
The independence hypothesis… • … makes computation possible • … yields optimal classifiers when satisfied • … but is seldom satisfied in practice, as attributes (variables) are often correlated. • Attempts to overcome this limitation: • Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes • Decision trees, that reason on one attribute at the time, considering most important attributes first Data Mining: Concepts and Techniques
Y Z P Bayesian Belief Networks • Bayesian belief network allows a subset of the variables conditionally independent • A graphical model of causal relationships • Represents dependency among the variables • Gives a specification of joint probability distribution • Nodes: random variables • Links: dependency • X and Y are the parents of Z, and Y is the parent of P • No dependency between Z and P • Has no loops or cycles X Data Mining: Concepts and Techniques
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.7 0.5 0.1 ~LC 0.2 0.5 0.3 0.9 Bayesian Belief Network: An Example Family History Smoker The conditional probability table (CPT) for variable LungCancer: LungCancer Emphysema CPT shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Derivation of the probability of a particular combination of values of X, from CPT: Bayesian Belief Networks Data Mining: Concepts and Techniques
Training Bayesian Networks • Several scenarios: • Given both the network structure and all variables observable: learn only the CPTs • Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning • Network structure unknown, all variables observable: search through the model space to reconstruct network topology • Unknown structure, all hidden variables: No good algorithms known for this purpose • Ref. D. Heckerman: Bayesian networks for data mining Data Mining: Concepts and Techniques