220 likes | 236 Views
Learn the fundamentals of data mining with Bayesian networks, including unconditional and conditional probability, joint probability, conditional independence, and creating a Bayesian network. Explore examples and applications in various domains.
E N D
Data Mining with Bayesian Networks (I) Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang@cs.ust.hk Thanks: Dan Weld, Eibe Frank
Windy=True Windy=False Play=yes Play=no Basics • Unconditional or Prior Probability • Pr(Play=yes) + Pr(Play=no)=1 • Pr(Play=yes) is sometimes written as Pr(Play) • Table has 9 yes, 5 no • Pr(Play=yes)=9/(9+5)=9/14 • Thus, Pr(Play=no)=5/14 • Joint Probability of Play and Windy: • Pr(Play=x,Windy=y) for all values x and y, should be 1 3/14 6/14 3/14 ?
Probability Basics • Conditional Probability • Pr(A|B) • # (Windy=False)=8 • Within the 8, • #(Play=yes)=6 • Pr(Play=yes | Windy=False) =6/8 • Pr(Windy=False)=8/14 • Pr(Play=Yes)=9/14 • Applying Bayes Rule • Pr(B|A) = Pr(A|B)Pr(B) / Pr(A) • Pr(Windy=False|Play=yes)= 6/8*8/14/(9/14)=6/9
Conditional Independence • “A and P are independent given C” • Pr(A | P,C) = Pr(A | C) C A P Probability F F F 0.534 F F T 0.356 F T F 0.006 F T T 0.004 T F F 0.048 T F T 0.012 T T F 0.032 T T T 0.008 Ache Cavity Probe Catches
C A P Probability F F F 0.534 F F T 0.356 F T F 0.006 F T T 0.004 T F F 0.012 T F T 0.048 T T F 0.008 T T T 0.032 Suppose C=True Pr(A|P,C) = 0.032/(0.032+0.048) = 0.032/0.080 = 0.4 Pr(A|C) = 0.032+0.008/ (0.048+0.012+0.032+0.008) = 0.04 / 0.1 = 0.4 Conditional Independence • “A and P are independent given C” • Pr(A | P,C) = Pr(A | C) and also Pr(P | A,C) = Pr(P | C)
C P(A) T 0.4 F 0.02 P(C) .01 C P(P) T 0.8 F 0.4 Conditional Independence Conditional probability table (CPT) • Can encode joint probability distribution in compact form C A P Probability F F F 0.534 F F T 0.356 F T F 0.006 F T T 0.004 T F F 0.012 T F T 0.048 T T F 0.008 T T T 0.032 Ache Cavity Probe Catches
Creating a Network • 1: Bayes net = representation of a JPD • 2: Bayes net = set of cond. independence statements • If create correct structure that represents causality • Then get a good network • i.e. one that’s small = easy to compute with • One that is easy to fill in numbers
Example • My house alarm system just sounded (A). • Both an earthquake (E) and a burglary (B) could set it off. • John will probably hear the alarm; if so he’ll call (J). • But sometimes John calls even when the alarm is silent • Mary might hear the alarm and call too (M), but not as reliably • We could be assured a complete and consistent model by fully specifying the joint distribution: • Pr(A, E, B, J, M) • Pr(A, E, B, J, ~M) • etc.
Structural Models (HK book 7.4.3) Instead of starting with numbers, we will start with structural relationships among the variables There is a direct causal relationship from Earthquake to Alarm There is a direct causal relationship from Burglar to Alarm There is a direct causal relationship from Alarm to JohnCall Earthquake and Burglar tend to occur independently etc.
Earthquake Burglary Alarm MaryCalls JohnCalls Possible Bayesian Network
P(E) .002 P(B) .001 B T T F F E T F T F P(A) .95 .94 .29 .01 A T F P(J) .90 .05 A T F P(M) .70 .01 Complete Bayesian Network Earthquake Burglary Alarm MaryCalls JohnCalls
Microsoft Bayesian Belief Net • http://research.microsoft.com/adapt/MSBNx/ • Can be used to construct and reason with Bayesian Networks • Consider the example
Learning problem Some methods are proposed Difficult problem Often requires domain expert’s knowledge Once set up, a Bayesian Network can be used to provide probabilistic queries Microsoft Bayesian Network Software Problems: Known structure, fully observable CPTables are to be learned Unknown structure, fully observable Search structures Known Structure, hidden var Parameter learning using hill climbing Unknown (Structure,Var) No good results Mining for Structural Models
Hidden Variable (Han and Kamber’s Data Mining book, pages 301-302) • Assume that the Bayesian Network structure is given • Some variables are hidden • Example: • Our objective: find the CPT for all nodes • Idea: • Use a method of gradient descent • Let S be the set of training examples: {X1, X2, … Xs} • Consider a variable Yi and Parents Ui={Parent1, Parent2, …}. • Question: What is Pr(Yi=yij | Ui=uik)? • Answer: learn this value from the data in iterations
Learn CPT for Hidden Variable • Suppose we are in a Tennis Domain • We wish to introduce a new variable not in our data set, called Field Temp • It represents the temperature of the field • Assume that we don’t have a good way to measure it, but have to include it in our network Windy Outlook Field Temp
Learn the CPT Ui Parent1 Parent2 • Let wijk be the value of Pr(Yi|Ui) • Compute a new wijk based on the old Yi
Example: Learn the CPT Windy Outlook w=Pr(Field Temp=Hot|Windy=True,Outlook=Sunny) • Let the old w be 0.5. Compute a new w Field Temp Normalize and then iterate until stable.