Directed Graphical Probabilistic Models

Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008

Some practical problems “Loaded” means P(19 or 20)=0.5 • I have 3 standard d20 dice, 1 loaded die. • Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2 • What is P(A=fair|B=criticalHit)?

Definition of Conditional Probability P(A ^ B) P(A|B) = ----------- P(B) Corollaries: Chain rule P(A ^ B) = P(A|B) P(B) P(B|A) = P(A|B) P(B) / P(A) Bayes rule

The (highly practical) Monty Hall problem 3 • You’re in a game show. Behind one door is a prize. Behind the others, goats. • You pick one of three doors, say #1 • The host, Monty Hall, opens one door, revealing…a goat! • You now can either • stick with your guess • always change doors • flip a coin and pick a new door randomly according to the coin

Some practical problems • I have 1 standard d6 die, 2 loaded d6 die. • Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 • Experiment: pick two d6 uniformly at random (A) and roll them. What is more likely – rolling a seven or rolling doubles? Three combinations: HL, HF, FL P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)

A brute-force solution Comment doubles • A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk • With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g. • P(doubles) • P(seven or eleven) • P(total is higher than 5) • …. seven doubles doubles

The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables:

The Joint Distribution Example: Boolean variables A, B, C • Recipe for making a joint distribution of M variables: • Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).

The Joint Distribution Example: Boolean variables A, B, C • Recipe for making a joint distribution of M variables: • Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). • For each combination of values, say how probable it is.

The Joint Distribution Example: Boolean variables A, B, C • Recipe for making a joint distribution of M variables: • Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). • For each combination of values, say how probable it is. • If you subscribe to the axioms of probability, those numbers must sum to 1. A 0.05 0.10 0.05 0.10 0.25 0.05 C 0.10 B 0.30

Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute

Using the Joint P(Poor Male) = 0.4654

Using the Joint P(Poor) = 0.7604

Inference with the Joint

Inference with the Joint P(Male | Poor) = 0.4654 / 0.7604 = 0.612

Inference is a big deal • I’ve got this evidence. What’s the chance that this conclusion is true? • I’ve got a sore neck: how likely am I to have meningitis? • I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? • … • Lots and lots of important problems can be solved algorithmically using joint inference. • How do you get from the statement of the problem to the jointprobability table? (What is a “statement of the problem”?) • The joint probability table is usually too big to represent explicitly, so how do you represent it compactly?

Problem 1: A description of the experiment • I have 3 standard d20 dice, 1 loaded die. • Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B=critical | A=fair=0.1 P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5 A B P(A=fair)=0.75 P(A=loaded)=0.25

Problem 1: A description of the experiment problem with probs:Bayes net::word problem:algebra • This is • an “influence diagram” (informal: what influences what) • a Bayes net / belief network / … (formal: more later!) • a description of the experiment (i.e., how data could be generated) • everything you need to reconstruct the joint distribution Chain rule: P(A,B)=P(B|A) P(A) A B

Problem 2: a description • I have 1 standard d6 die, 2 loaded d6 die. • Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 • Experiment: pick two d6 without replacement (D1,D2) and roll them. What is more likely – rolling a seven or rolling doubles? D1 D2 R

Problem 2: a description P(R,D1,D2) = P(R|D1,D2) P(D2|D1) P(D1) P(R,D1,D2) = P(R|D1,D2) P(D1,D2) Chain rule P(A ^ B) = P(A|B) P(B) D1 D2 R

Problem 2: another description R2=1,2,…,6 R2 D1 D2 R1 Is7 Is7=yes,no R1=1,2,…,6

The (highly practical) Monty Hall problem First guess The money • You’re in a game show. Behind one door is a prize. Behind the others, goats. • You pick one of three doors, say #1 • The host, Monty Hall, opens one door, revealing…a goat! • You now can either stick with your guess or change doors A B Stick, or swap? The goat C D E Second guess

The (highly practical) Monty Hall problem First guess The money A B Stick or swap? The goat C D E Second guess

The (highly practical) Monty Hall problem We could construct the joint and compute P(E=B|D=swap) First guess The money A B …again by the chain rule: P(A,B,C,D,E) = P(E|A,C,D) * P(D) * P(C | A,B ) * P(B ) * P(A) Stick or swap? The goat C D E Second guess

The (highly practical) Monty Hall problem The joint table has…? First guess The money A B 3*3*3*2*3 = 162 rows Stick or swap? The conditional probability tables (CPTs) shown have … ? • Big questions: • why are the CPTs smaller? • how muchsmaller are the CPTs than the joint? • can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs? The goat C D 3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows E Second guess

The (highly practical) Monty Hall problem First guess The money Why is the CPTs representation smaller? Follow the money! (B) A B Stick or swap? The goat C D E is conditionally independent of B given A,D,C E Second guess

Conditional Independence formalized Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S1’s assignments  S2’s assignments & S3’s assignments)= P(S1’s assignments  S3’s assignments)

The (highly practical) Monty Hall problem First guess The money • What are the conditional indepencies? • I<A, {B}, C> ? • I<A, {C}, B> ? • I<E, {A,C}, B> ? • I<D, {E}, B> ? • … A B Stick or swap? The goat C D E Second guess

Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where: • V is a set of vertices. • E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: • The name of a random variable • A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.

Building a Bayes Net • Choose a set of relevant variables. • Choose an ordering for them • Assume they’re called X1 .. Xm (where X1 is the first in the ordering, X1 is the second, etc) • For i = 1 to m: • Add the Xi node to the network • Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi ) • Define the probability table of P(Xi=k Assignments of Parents(Xi ) ).

The general case P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) = P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) = P(Xn=xn  Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) = P(Xn=xn  Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) * P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) = : : = So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

Building a Bayes Net: Example 1 D1 – first die; D2 – second; R – roll, Is7 • Choose a set of relevant variables. • Choose an ordering for them • Assume they’re called X1 .. Xm (where X1 is the first in the ordering, X1 is the second, etc) • For i = 1 to m: • Add the Xi node to the network • Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi ) • Define the probability table of P(Xi=k values of Parents(Xi ) ) D1, D2, R, Is7 D1 D2 R Is7 P(D1,D2,R,Is7)= P(Is7|R)*P(R|D1,D2)*P(D2|D1)*P(D1)

Another construction Pick the order D1, D2, R1, R2, Is7 Network will be: R2=1,2,…,6 R2 What if I pick other orders? What about R2,R1,D2,D1,Is7? D1 D2 R1 Is7 What about the first network AB? suppose I picked the order B,A ? Is7=yes,no R1=1,2,…,6

Another simple example • Bigram language model: • Pick word w0 from a multinomial 1 P(w0) • For t=1 to 100 • Pick word wt from multinomial 2 P(wt|wt-1) w0 w1 w2 … w100

A simple example Size of CPTs: |V| + |V|2 Size of joint table: |V| ^ 101 • Bigram language model: • Pick word w0 from a multinomial 1 P(w0) • For t=1 to 100 • Pick word wt from multinomial 2 P(wt|wt-1) |V|=20 for amino acids (proteins) w0 w1 w2 … w100

A simple example Size of CPTs: |V| + |V|2 +|V|3 Size of joint table: |V| ^ 101 • Trigram language model: • Pick word w0 from a multinomial M1 P(w0) • Pick word w1 from a multinomial M2 P(w1|w0) • For t=2 to 100 • Pick word wt from multinomial M3 P(wt|wt-1,wt-2) |V|=20 for amino acids (proteins) w0 w1 w2 w100 … w99

What Independencies does a Bayes Net Model? • In order for a Bayesian network to model a probability distribution, the following must be true by definition: Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents. • This implies • But what else does it imply?

What Independencies does a Bayes Net Model? Z Y X Given Y, does learning the value of Z tell us nothing new about X? I.e., is P(X|Y, Z) equal to P(X | Y)? Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y). • Example:

What Independencies does a Bayes Net Model? Y X Z • Let I<X,Y,Z> represent X and Z being conditionally independent given Y. • I<X,Y,Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant.

What Independencies does a Bayes Net Model? • I<X,{U},Z>? No. • I<X,{U,V},Z>? Yes. • Maybe I<X, S, Z> iff S acts a cutset between X and Z in an undirected version of the graph…? Z U V X

Things get a little more confusing • X has no parents, so we know all its parents’ values trivially • Z is not a descendant of X • So, I<X,{},Z>, even though there’s a undirected path from X to Z through an unknown variable Y. • What if we do know the value of Y, though? Or one of its descendants? X Z Y

The “Burglar Alarm” example Burglar Earthquake Alarm Phone Call • Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. • Earth arguably doesn’t care whether your house is currently being burgled • While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!

Things get a lot more confusing Burglar Earthquake Alarm Phone Call • But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. • Earthquake “explains away” the hypothetical burglar. • But then it must not be the case that I<Burglar,{Phone Call}, Earthquake>, even though I<Burglar,{}, Earthquake>!

d-separation to the rescue • Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation. • Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true: ... ie. X and Z are dependent iff there exists an unblocked path

A path is “blocked” when... Y • There exists a variable V on the path such that • it is in the evidence set E • the arcs putting Y in the path are “tail-to-tail” • Or, there exists a variable V on the path such that • it is in the evidence set E • the arcs putting Y in the path are “tail-to-head” • Or, ... unknown “common causes” of X and Z impose dependency unknown “causal chains” connecting X an Z impose dependency Y

A path is “blocked” when… (the funky case) • … Or, there exists a variable Y on the path such that • it is NOT in the evidence set E • neither are any of its descendants • the arcs putting Y on the path are “head-to-head” Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z Y

d-separation to the rescue, cont’d • Theorem [Verma & Pearl, 1998]: • If a set of evidence variables E d-separates X and Z in a Bayesian network’s graph, then I<X, E, Z>. • d-separation can be computed in linear time using a depth-first-search-like algorithm. • Great! We now have a fast algorithm for automatically inferring whether learning the value of one variable might give us any additional hints about some other variable, given what we already know. • “Might”: Variables may actually be independent when they’re not d-separated, depending on the actual probabilities involved

Directed Graphical Probabilistic Models

Directed Graphical Probabilistic Models

Presentation Transcript

Graphical Models

Exact and approximate inference in probabilistic graphical models

Exact and approximate inference in probabilistic graphical models

Graphical Models

Graphical Models

Undirected Probabilistic Graphical Models (Markov Nets)

Probabilistic Models

Graphical Models

Probabilistic Graphical Models for Semi-Supervised Traffic Classification

GRAPHICAL MODELS

Probabilistic graphical models

Probabilistic Graphical Models

Directed Graphical Probabilistic Models:

Probabilistic graphical models and regulatory networks

Probabilistic Graphical Models

Probabilistic and Possibilistic Graphical Models in Complex Applications

Graphical Models

Bayesian Networks (Directed Acyclic Graphical Models)

Probabilistic Graphical Models

Bayesian Networks (Directed Acyclic Graphical Models)

Graphical Models