470 likes | 509 Views
Explore practical problems with probabilistic models like the Monty Hall problem and joint distribution calculation. Understand Bayes' rule, chain rule, and conditional probability for advanced data analysis.
E N D
Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008
Some practical problems “Loaded” means P(19 or 20)=0.5 • I have 3 standard d20 dice, 1 loaded die. • Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2 • What is P(A=fair|B=criticalHit)?
Definition of Conditional Probability P(A ^ B) P(A|B) = ----------- P(B) Corollaries: Chain rule P(A ^ B) = P(A|B) P(B) P(B|A) = P(A|B) P(B) / P(A) Bayes rule
The (highly practical) Monty Hall problem 3 • You’re in a game show. Behind one door is a prize. Behind the others, goats. • You pick one of three doors, say #1 • The host, Monty Hall, opens one door, revealing…a goat! • You now can either • stick with your guess • always change doors • flip a coin and pick a new door randomly according to the coin
Some practical problems • I have 1 standard d6 die, 2 loaded d6 die. • Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 • Experiment: pick two d6 uniformly at random (A) and roll them. What is more likely – rolling a seven or rolling doubles? Three combinations: HL, HF, FL P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)
A brute-force solution Comment doubles • A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk • With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g. • P(doubles) • P(seven or eleven) • P(total is higher than 5) • …. seven doubles doubles
The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables:
The Joint Distribution Example: Boolean variables A, B, C • Recipe for making a joint distribution of M variables: • Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
The Joint Distribution Example: Boolean variables A, B, C • Recipe for making a joint distribution of M variables: • Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). • For each combination of values, say how probable it is.
The Joint Distribution Example: Boolean variables A, B, C • Recipe for making a joint distribution of M variables: • Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). • For each combination of values, say how probable it is. • If you subscribe to the axioms of probability, those numbers must sum to 1. A 0.05 0.10 0.05 0.10 0.25 0.05 C 0.10 B 0.30
Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute
Using the Joint P(Poor Male) = 0.4654
Using the Joint P(Poor) = 0.7604
Inference with the Joint P(Male | Poor) = 0.4654 / 0.7604 = 0.612
Inference is a big deal • I’ve got this evidence. What’s the chance that this conclusion is true? • I’ve got a sore neck: how likely am I to have meningitis? • I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? • … • Lots and lots of important problems can be solved algorithmically using joint inference. • How do you get from the statement of the problem to the jointprobability table? (What is a “statement of the problem”?) • The joint probability table is usually too big to represent explicitly, so how do you represent it compactly?
Problem 1: A description of the experiment • I have 3 standard d20 dice, 1 loaded die. • Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B=critical | A=fair=0.1 P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5 A B P(A=fair)=0.75 P(A=loaded)=0.25
Problem 1: A description of the experiment problem with probs:Bayes net::word problem:algebra • This is • an “influence diagram” (informal: what influences what) • a Bayes net / belief network / … (formal: more later!) • a description of the experiment (i.e., how data could be generated) • everything you need to reconstruct the joint distribution Chain rule: P(A,B)=P(B|A) P(A) A B
Problem 2: a description • I have 1 standard d6 die, 2 loaded d6 die. • Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 • Experiment: pick two d6 without replacement (D1,D2) and roll them. What is more likely – rolling a seven or rolling doubles? D1 D2 R
Problem 2: a description P(R,D1,D2) = P(R|D1,D2) P(D2|D1) P(D1) P(R,D1,D2) = P(R|D1,D2) P(D1,D2) Chain rule P(A ^ B) = P(A|B) P(B) D1 D2 R
Problem 2: another description R2=1,2,…,6 R2 D1 D2 R1 Is7 Is7=yes,no R1=1,2,…,6
The (highly practical) Monty Hall problem First guess The money • You’re in a game show. Behind one door is a prize. Behind the others, goats. • You pick one of three doors, say #1 • The host, Monty Hall, opens one door, revealing…a goat! • You now can either stick with your guess or change doors A B Stick, or swap? The goat C D E Second guess
The (highly practical) Monty Hall problem First guess The money A B Stick or swap? The goat C D E Second guess
The (highly practical) Monty Hall problem We could construct the joint and compute P(E=B|D=swap) First guess The money A B …again by the chain rule: P(A,B,C,D,E) = P(E|A,C,D) * P(D) * P(C | A,B ) * P(B ) * P(A) Stick or swap? The goat C D E Second guess
The (highly practical) Monty Hall problem The joint table has…? First guess The money A B 3*3*3*2*3 = 162 rows Stick or swap? The conditional probability tables (CPTs) shown have … ? • Big questions: • why are the CPTs smaller? • how muchsmaller are the CPTs than the joint? • can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs? The goat C D 3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows E Second guess
The (highly practical) Monty Hall problem First guess The money Why is the CPTs representation smaller? Follow the money! (B) A B Stick or swap? The goat C D E is conditionally independent of B given A,D,C E Second guess
Conditional Independence formalized Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x M=y ^ L=z) = P(R=x M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S1’s assignments S2’s assignments & S3’s assignments)= P(S1’s assignments S3’s assignments)
The (highly practical) Monty Hall problem First guess The money • What are the conditional indepencies? • I<A, {B}, C> ? • I<A, {C}, B> ? • I<E, {A,C}, B> ? • I<D, {E}, B> ? • … A B Stick or swap? The goat C D E Second guess
Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where: • V is a set of vertices. • E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: • The name of a random variable • A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.
Building a Bayes Net • Choose a set of relevant variables. • Choose an ordering for them • Assume they’re called X1 .. Xm (where X1 is the first in the ordering, X1 is the second, etc) • For i = 1 to m: • Add the Xi node to the network • Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi ) • Define the probability table of P(Xi=k Assignments of Parents(Xi ) ).
The general case P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) = P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) = P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) = P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) * P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) = : : = So any entry in joint pdf table can be computed. And so any conditional probability can be computed.
Building a Bayes Net: Example 1 D1 – first die; D2 – second; R – roll, Is7 • Choose a set of relevant variables. • Choose an ordering for them • Assume they’re called X1 .. Xm (where X1 is the first in the ordering, X1 is the second, etc) • For i = 1 to m: • Add the Xi node to the network • Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi ) • Define the probability table of P(Xi=k values of Parents(Xi ) ) D1, D2, R, Is7 D1 D2 R Is7 P(D1,D2,R,Is7)= P(Is7|R)*P(R|D1,D2)*P(D2|D1)*P(D1)
Another construction Pick the order D1, D2, R1, R2, Is7 Network will be: R2=1,2,…,6 R2 What if I pick other orders? What about R2,R1,D2,D1,Is7? D1 D2 R1 Is7 What about the first network AB? suppose I picked the order B,A ? Is7=yes,no R1=1,2,…,6
Another simple example • Bigram language model: • Pick word w0 from a multinomial 1 P(w0) • For t=1 to 100 • Pick word wt from multinomial 2 P(wt|wt-1) w0 w1 w2 … w100
A simple example Size of CPTs: |V| + |V|2 Size of joint table: |V| ^ 101 • Bigram language model: • Pick word w0 from a multinomial 1 P(w0) • For t=1 to 100 • Pick word wt from multinomial 2 P(wt|wt-1) |V|=20 for amino acids (proteins) w0 w1 w2 … w100
A simple example Size of CPTs: |V| + |V|2 +|V|3 Size of joint table: |V| ^ 101 • Trigram language model: • Pick word w0 from a multinomial M1 P(w0) • Pick word w1 from a multinomial M2 P(w1|w0) • For t=2 to 100 • Pick word wt from multinomial M3 P(wt|wt-1,wt-2) |V|=20 for amino acids (proteins) w0 w1 w2 w100 … w99
What Independencies does a Bayes Net Model? • In order for a Bayesian network to model a probability distribution, the following must be true by definition: Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents. • This implies • But what else does it imply?
What Independencies does a Bayes Net Model? Z Y X Given Y, does learning the value of Z tell us nothing new about X? I.e., is P(X|Y, Z) equal to P(X | Y)? Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y). • Example:
What Independencies does a Bayes Net Model? Y X Z • Let I<X,Y,Z> represent X and Z being conditionally independent given Y. • I<X,Y,Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant.
What Independencies does a Bayes Net Model? • I<X,{U},Z>? No. • I<X,{U,V},Z>? Yes. • Maybe I<X, S, Z> iff S acts a cutset between X and Z in an undirected version of the graph…? Z U V X
Things get a little more confusing • X has no parents, so we know all its parents’ values trivially • Z is not a descendant of X • So, I<X,{},Z>, even though there’s a undirected path from X to Z through an unknown variable Y. • What if we do know the value of Y, though? Or one of its descendants? X Z Y
The “Burglar Alarm” example Burglar Earthquake Alarm Phone Call • Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. • Earth arguably doesn’t care whether your house is currently being burgled • While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!
Things get a lot more confusing Burglar Earthquake Alarm Phone Call • But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. • Earthquake “explains away” the hypothetical burglar. • But then it must not be the case that I<Burglar,{Phone Call}, Earthquake>, even though I<Burglar,{}, Earthquake>!
d-separation to the rescue • Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation. • Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true: ... ie. X and Z are dependent iff there exists an unblocked path
A path is “blocked” when... Y • There exists a variable V on the path such that • it is in the evidence set E • the arcs putting Y in the path are “tail-to-tail” • Or, there exists a variable V on the path such that • it is in the evidence set E • the arcs putting Y in the path are “tail-to-head” • Or, ... unknown “common causes” of X and Z impose dependency unknown “causal chains” connecting X an Z impose dependency Y
A path is “blocked” when… (the funky case) • … Or, there exists a variable Y on the path such that • it is NOT in the evidence set E • neither are any of its descendants • the arcs putting Y on the path are “head-to-head” Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z Y
d-separation to the rescue, cont’d • Theorem [Verma & Pearl, 1998]: • If a set of evidence variables E d-separates X and Z in a Bayesian network’s graph, then I<X, E, Z>. • d-separation can be computed in linear time using a depth-first-search-like algorithm. • Great! We now have a fast algorithm for automatically inferring whether learning the value of one variable might give us any additional hints about some other variable, given what we already know. • “Might”: Variables may actually be independent when they’re not d-separated, depending on the actual probabilities involved