290 likes | 468 Views
Data Classification by Statistical Methods. Wei-Min Shen Information Sciences Institute University of Southern California. Outline. Basic of Probability theory Naive Bayesian Classifier Boost Algorithms Bayesian Network. Bayesian Probability Theory. What is probability?
E N D
Data Classification by Statistical Methods Wei-Min Shen Information Sciences Institute University of Southern California UCLA Data Mining Short Course (2)
Outline • Basic of Probability theory • Naive Bayesian Classifier • Boost Algorithms • Bayesian Network UCLA Data Mining Short Course (2)
Bayesian Probability Theory • What is probability? • Subjective vs objective • Simple and principled • P(A|C) + P(~A|C) = 1 • P(AB|C) = P(A|C)P(B|AC) = P(B|C)P(A|BC) • Are there anything else? • P(A B|C) = 1 - P(~A~B|C) = … UCLA Data Mining Short Course (2)
Certainty and Impossible • If A is certain when given C, then P(A|C)=1 • Assume B doesn’t contradict C, then P(AB|C)=P(B|C), P(A|BC)=P(A|C), and P(B|C)=P(A|C)P(B|C) (by law2), and P(A|C)=1 • If A is impossible when C, then P(A|C)=0 • By law 1 UCLA Data Mining Short Course (2)
The Product and Sum Laws • P(AB|C) = P(A|C) P(B|AC) = P(B|C)P(A|BC) • If A,B independent, P(AB|C) = P(A|C)P(B|C) • If AB, then P(B|AC)=1, P(A|C) P(B|C) • P(AB|C) = P(A|C)P(B|AC) = P(A|C)*1, or • P(AB|C) = min(P(A|C),P(B|C)), the “fuzzy” rule • P(AB|C) = P(A|C) + P(B|C) - P(AB|C) • If AB, then P(AB|C)=P(A|C), P(A|C) P(B|C) • P(A B|C) = P(A|C)+P(B|C)-P(A|C) = P(B|C) • P(A B|C) = max(P(A|C),P(B|C)), the other “fuzzy” rule UCLA Data Mining Short Course (2)
Generalization of Classic Logic • P(A|C)P(B|AC) = P(B|C)P(A|BC) • Quantitative deductive inference • If AB and A, then B • If AB and ~B, then ~A (abduction) • If AB and B, then “A becomes more plausible” • Quantitative inductive inference • If AB and ~A, then “B becomes less plausible” • If A”B becomes more plausible” and B, then “A becomes more plausible” UCLA Data Mining Short Course (2)
Suitable for Mining • P(A|BC) = P(A|C)P(B|AC) / P(B|C) • P(A|BC): your new knowledge of the theory A after experiment B, given the background knowledge C • P(A|C): what you know about A without B, • P(B|AC): the possibility of B if your current understanding of A were correct, • P(B|C): the possibility of knowing B anyway. • This is recursive: your new understanding of A becomes a part of C, which is used to make a newer understanding of A. • Caution: the recipe of “Rabbit Stew” UCLA Data Mining Short Course (2)
Concept Learning with Probability Theory • Given • X the current information about the learning task • Ai: “Ci is the target concept” • Dj: “the instance j is in the target concept” • Here is how • P(Ai|DjX) = P(Ai|X)P(Dj|AiX) / P(Dj|X) UCLA Data Mining Short Course (2)
Naive Bayesian Classifier • Let A1,…,Ak be attributes, [a1, …, ak] an example, C a class to be predicted, then the optimal prediction is class value c such that P(C=c | A1=a1 … Ak=ak) is maximal. • By Bayesian rule, this equals P(A1=a1 … Ak=ak |C=c) * P(C=c) P(A1=a1 … Ak=ak ) UCLA Data Mining Short Course (2)
Naïve Bayesian Classifier • P(C=c) is easy to estimate from training data • P(A1=a1 … Ak=ak ) is irrelevant (same for all c) • Compute P(A1=a1 … Ak=ak |C=c) • assume attributes are independent (naïve) • P(A1=a1|C=c) P(A1=a1|C=c) … P(Ak=ak|C=c) • where each term can be estimated as • P(Aj=aj|C=c) = count(Aj=aj|C=c) / count(C=c) UCLA Data Mining Short Course (2)
Tennis Example Revisited • [Day, Outlook, Temp, Humidity, Wind, PlayTennis] • (D1 Sunny Hot High Weak No)(D2 Sunny Hot High Strong No)(D3 Overcast Hot High Weak Yes)(D4 Rain Mild High Weak Yes)(D5 Rain Cool Normal Weak Yes)(D6 Rain Cool Normal Strong No)(D7 Overcast Cool Normal Strong Yes)(D8 Sunny Mild High Weak No)(D9 Sunny Cool Normal Weak Yes)(D10 Rain Mild Normal Weak Yes)(D11 Sunny Mild Normal Strong Yes)(D12 Overcast Mild High strong Yes)(D13 Overcast Hot Normal Weak Yes)(D14 Rain Mild High Strong No) UCLA Data Mining Short Course (2)
Estimated Probabilities • P(Play=Yes | [Sunny, Hot, High, Weak]) = ? • P(Play=No | [Sunny, Hot, High, Weak]) = ? (winner) • P(Play=Yes) = 9/14 • P(Play=No) = 5/14 • P(Outlook=Sunny | Play=Yes) = 2/9 • P(Outlook=Sunny | Play=No) = 3/5 • P(Temp=Hot | Play=Yes) = 2/9 • P(Temp=Hot | Play=No) = 2/5 • P(Humidity=High | Play=Yes) = 3/9 • P(Humidity=High | Play=No) = 4/5 • P(Wind=Weak | Play=Yes) = 6/9 • P(Wind=Weak | Play=No) = 2/5 UCLA Data Mining Short Course (2)
Boosting • To learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor • For t =1 to T times: • Learn the classifier Ht on the weighted training examples • Increases the weights of training examples misclassified by Ht UCLA Data Mining Short Course (2)
The AdaBoost Algorithm(Freund and Schapire 1995) • Let xiX be an example, yi its class • Assign equal weight to all N examples, w(1)i=1/N • For t =1 through T do: • Given w(1)i obtain a hypothesis H(t): X [0,1]. • Let the error of H(t): be (t) = w(t)i |yi - hi(xi)|. • Let (t) = (t) / (1 - (t) ) and let w(t+1)i = w(t)i ((t))1- |yi - hi(xi)| • Normalize w(t+1)i to sum to 1.0. • The final combination hypothesis H is: • H(x) = 1 / [1+((t))2r(x)-1], where • r(x) = [(log1/(t))H(t)(x)] / (log1/(t)). UCLA Data Mining Short Course (2)
Boosted Naïve Bayesian Algorithm (Elkan 97) • Let w(t)i be the weights of examples at time t, H(t)(x) = H(t)([a1,…,ak]) = maxc{P(t)(A1=a1|C=c)…P(t)(Ak=ak|C=c)P(t)(C=c)}where • P(t)(Aj=aj|C=c) = gw(t)g(Ai=ai|C=c) / hw(t)h(C=c) • P(t)(C=c) = lw(t)l(C=c) UCLA Data Mining Short Course (2)
Tennis Example Revisited • [Day, Outlook, Temp, Humidity, Wind, PlayTennis] • (D1 Sunny Hot High Weak No) 1/28 • (D2 Sunny Hot High Strong No) 3/28 • (D3 Overcast Hot High Weak Yes) 1/28 • (D4 Rain Mild High Weak Yes) 3/28 • (D5 Rain Cool Normal Weak Yes) 1/28 • (D6 Rain Cool Normal Strong No) 3/28 • (D7 Overcast Cool Normal Strong Yes) 1/28 • (D8 Sunny Mild High Weak No) 3/28 • (D9 Sunny Cool Normal Weak Yes) 1/28 • (D10 Rain Mild Normal Weak Yes) 3/28 • (D11 Sunny Mild Normal Strong Yes) 1/28 • (D12 Overcast Mild High strong Yes) 3/28 • (D13 Overcast Hot Normal Weak Yes) 1/28 • (D14 Rain Mild High Strong No) 3/28 UCLA Data Mining Short Course (2)
Estimate Probabilities Based on Weighted Examples • P(Play=Yes | [Sunny, Hot, High, Weak]) = ? • P(Play=No | [Sunny, Hot, High, Weak]) = ? • P(Play=Yes) = 15/28 • P(Play=No) = 13/28 • P(Outlook=Sunny | Play=Yes) = 2/15 • P(Outlook=Sunny | Play=No) = 7/13 UCLA Data Mining Short Course (2)
How Good Is Boosted Naive Bayesian Classifier • Very Efficient: • Suppose there are e examples that are over f attributes, each with v values, then the algorithm’s complexity is O(ef) independent of v. • Learning a decision tree without pruning requires O(ef2) time. • Accuracy: Won the first place in 1997 KDDCUP • “Remarkably successful in practice, and no uniformly better learning method is known.” (Elkan97) UCLA Data Mining Short Course (2)
Bayesian Belief Networks • Concise representation of conditional probabilities used in Bayesian inference • Plays the role of the model in data mining • Can encode effects of actions • Patterns may be probabilistic • Can be learned • Powerful representation language • probabilistic propositional logic • no quantifiers • relations encoded by probabilistic correlation UCLA Data Mining Short Course (2)
Bayes Nets • Model supports several forms of inference • Diagnostic: from effects to causes • If meningitis and whiplash both cause stiff neck and observe stiff neck, which cause is most likely • P(Men|Stiff) vs. P(Whip|Stiff) • Useful in learning action models • Useful in identifying hidden state • Causal: from causes to effects • What are the likely effects of contracting meningitis? • P(Stiff|Men) • Useful in predicting effects of actions • Mixed: combination of the above. UCLA Data Mining Short Course (2)
More Probability Theory • Pr(Raining|C) • In logic “Raining” is Boolean variable • In probability “Raining” is special case of: random variable. (Binomial variable with probability p) • General case: random variables can take on a set of values. Pr(Precipitation=0.5in) • We associate a probability density function (pdf) with the r.v. This defines all probabilities for this set. • f(Toothache) = { (True)-->0.05, (False)-->0.95 } • f(Age) = gaussian(45,20) UCLA Data Mining Short Course (2)
Toothache Not Toothache .06 .04 P(Cavity) = .1 Cavity P(-Cavity) = .9 .01 .89 Not Cavity Joint Distribution • probability density function of a set of variables • f(Toothache, Cavity) : • joint distribution expresses relationship between variables • eg. Shoe size vs. reading-level • From joint, can derive all other probabilities • marginal probabilities : P(A), P(B) • conditional (from product rule): P(B|A) = P(AB)/P(A) UCLA Data Mining Short Course (2)
Example Cavity Ache ProbeCatch Probability T T T 0.03 T T F 0.01 T F T 0.045 T F F 0.015 F T T 0.0025 F T F 0.0075 F F T 0.2225 F F F 0.6675 P(Probe) = 0.25, P(Cavity) = 0.01, P(Ache) = 0.05 P(Cavity| Probe and Ache) ? UCLA Data Mining Short Course (2)
Conditional Independence • Joint distribution can be big and cumbersome • Boolean variables --> 2n entries • Often don’t need the whole table! • Definitions: A and B are independent if: • Pr(AB) = P(A)P(B) • Pr(A|B) = P(A) --> follows from product rule • A and B are conditionally independent given C if: • Pr(AB|C) = P(A|C)P(B|C), or Pr(A|BC) = Pr(A|C) • Can use this to simplify joint distribution function UCLA Data Mining Short Course (2)
Example • Toothache example • To do inference might need all 8 conditional probabilities • Probe and Ache are conditionally independent given Cavity P(Probe | Cavity,Ache) = P(Probe | Cavity) P(Ache | Cavity,Probe) = P(Ache | Cavity) P(Ache,Probe | Cavity) = P(Ache | Cavity)P(Probe | Cavity) • Can get away with fewer conditional probabilities UCLA Data Mining Short Course (2)
Intuition • There is a relationship between conditional independence and causality • If A causes B and C • There is a correlation between B and C • This correlation is broken given A • B and C are conditionally independent given A • Shoe size and reading level are independent given Age since age is the “direct cause” of both • Important because • people find it hard to recognize conditional independence • but find it easy to state direct causes UCLA Data Mining Short Course (2)
Bayes Belief Networks • Simple representation that “encodes” many conditional independencies • Each node is a random variable • A has directed link to B if A is “direct cause” of B • In statistical terms: Knowing all parents of B makes B conditionally independent of everything else • Each node has a conditional probability table that quantifies the effects parents have on the node • The graph has no directed cycles. • The joint distribution is the following product: • f(X1,...Xn) = P(Xi | Parents(Xi) ) UCLA Data Mining Short Course (2)
Example Fraud Age Sex Buys Gas Buys Jewelry Conditional Independencies P(A|F)=P(A), P(S|FA)=P(S), P(G|FAS)=P(G|F), P(J|FASG)=P(J|FAS) P(fasgj) = P(f)P(asgj|f) = P(f)P(a|f)P(sgj|af) = P(f)P(a|f)P(s|af)P(gj|saf) = P(f)P(a|f)P(s|af)P(g|saf)P(j|gsaf) ;; use conditional independencies = P(f) P(a) P(s) P(g|f) P(j|saf) UCLA Data Mining Short Course (2)
Inference • Inference in Bayes network: • fix the value of a subset of variables (evidence variables) • compute the posterior distribution of remain variables (query variables) • can do this for any arbitrary subset of variables • Several algorithms have been proposed • Easy if variables discrete and no undirected cycles • NP-hard if there are undirected cycles • in this case heuristic techniques are often effective UCLA Data Mining Short Course (2)