360 likes | 458 Views
Bayesian Networks Lecture 8. Edited by Dan Geiger from Nir Friedman’s slides. Conditional Independence. Two variables X and Y are conditionally independent given Z if P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z
E N D
Bayesian NetworksLecture 8 . Edited by Dan Geiger from Nir Friedman’s slides.
Conditional Independence • Two variables X and Y are conditionally independent given Z if • P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z • That is, learning the values of Y does not change prediction of X once we know the value of Z • notation: Ind( X ; Y | Z )
S V L T B A X D Bayesian Network p(s) p(v) p(t|v) p(l|s) p(b|s) p(a|t,l) p(d|a,b) p(x|a) Bayesian network = Directed Acyclic Graph (DAG), annotated with conditional probability distributions.
S V L T B A X D p(s) p(v) p(t|v) p(l|s) p(b|s) p(a|t,l) p(d|a,b) p(x|a) Bayesian Network (cont.) Each Directed Acyclic Graph defines a factorization of the form:
Smoking Visit to Asia Tuberculosis Lung Cancer Abnormality in Chest Bronchitis Dyspnea X-Ray The “Visit-to-Asia” Example
Lung Cancer (Yes/No) Tuberculosis (Yes/No) p(A|T,L) Abnormality in Chest (Yes/no) Local distributions Conditional Probability Table: p(A=y|L=n, T=n) = 0.02 p(A=y|L=n, T=y) = 0.60 p(A=y|L=y, T=n) = 0.99 p(A=y|L=y, T=y) = 0.99
Probability Distributions as Tables P(d|a,b) is a three dimensional table of the form TD,A,B. One entry in this table is Td,a,b and there are |D|·|A|·|B| entries. The product P(x,a)P(d|a,b) can be thought as regular product. But it also denotes a “table product”. The “table product” TX,A• TD,A,B is a new table TX,D,A,B with |X|·|D|·|A|·|B| entries, defined by the regular product Tx,d,a,b = Tx,a·Td,a,b The “table sum” xTX,A is a new table TA with |A|entries, defined by theregular sum Ta=xTx,a
Ind( Xi ; {X1,…,Xi-1}\Pai | Pai ) Independence in Bayesian networks This set of independence assertions is denoted Basis(G) .
X1 X2 X3 X4 Example I Ind( X3 ;X1 | X2) Ind( X4 ; {X1, X2}| X3)
S V L T B A X D Example II • In the order V,S,T,L,B,A,X,D, we have: • Ind( S; V ) • Ind( T; S | V ) • Ind( l; {T,V} | S ) • … • Ind( X; {V,S,T,L,B,D} |A) Does Ind( {X,D} ; V | A ) also hold ? To answer this question one needs to analyze the types of paths that connect {X,D} and V.
Si3f Li2f y2 Xi2 Li2m Li3f Xi3 Li3m Y3 Li1f Xi1 Y1 Li1m Si3m Locus 2 (Disease) Locus 3 Locus 4 Locus 1 Example III: Genetic Linkage Analysis This model depicts the qualitative relations between the variables. We will now specify the joint distribution over these variables.
d-separation Many other independence assertions are entailed by (*). Understanding which independence assertions are entailed is important for using Bayesian networks efficiently. In the tutorial we will define a procedure d-sepG(X; Y | Z) such that: Soundness: d-sepG (X; Y | Z) = yes implies Ind(X;Y|Z) follows from Basis(G) Completeness: d-sepG (X; Y | Z) = no implies Ind(X;Y|Z) does not follow from Basis(G)
S V L T B A X D Queries There are many types of queries. Most queries involve evidence An evidence e is an assignment of values to a set E ofvariables in the domain P(Dyspnea = Yes | Visit_to_Asia = Yes, Smoking=Yes) P(Smoking=Yes| Dyspnea = Yes )
Queries: A posteriori belief The conditional probability of a variable given the evidence This is the a posteriori belief in x, given evidence e Often we compute the term P(x, e) from which we can recover the a posteriori belief by Examples given in previous slide.
S V L T B A X D A posteriori belief This query is useful in many other cases: • Prediction: what is the probability of an outcome given the starting condition • Target is a descendent of the evidence (e.g., Does a visit to Asia lead to Tuberculosis ?) • Diagnosis: what is the probability of disease/fault given symptoms • Target is an ancestor of the evidence (e.g., Does the X-ray results indicate higher probability of Tuberculosis ?)
S V L T B A X D Example: Predictive+Diagnostic Probabilistic inference can combine evidence form all parts of the network, Diagnostic and Predictive, regardless of the directions of edges in the model. P(T = Yes | Visit_to_Asia = Yes, Dyspnea = Yes )
X1 X2 Xn-1 Xn For example, the quantity we computed a while ago from the Markov chain model for PAM matrices was the joint probability table (no evidence): Queries: A posteriori joint • In this query, we are interested in the conditional probability of several variables, given the evidenceP(X, Y, … | e ) • Note that the size of the answer to query is exponential in the number of variables in the joint
S1 S2 Sm-1 Sm R1 R2 Rm-1 Rm Application in communication: message sent is (s1,…,sm) but we receive (r1,…,rm) . Compute what is the most likely message sent ? Queries: MAP • Find the maximum a posteriori assignment for some variable of interest (say H1,…,Hl ) • That is, h1,…,hl maximize the conditional probabilityP(h1,…,hl | e) • Equivalent to maximizing the joint P(h1,…,hl, e)
D2 D1 D4 D3 S2 S1 S4 S3 Bad alternator Bad magneto Not charging Bad battery Dead battery Queries: MAP We can use MAP for: • Explanation • What is the most likely joint event, given the evidence (e.g., a set of likely diseases given the symptoms) • What is the most likely scenario, given the evidence (e.g., a series of likely malfunctions that trigger a fault).
Input: A Bayesian network, a set of nodes E with evidence E=e, an ordering x1,…,xm of all variables not in E. Output: P(x1,e) for every value x1 of X1. {from which p(x1|e) is available} The query: Computing A posteriori Belief in Bayesian Networks • Set the evidence in all local probability tables that are defined over some variables from E. • Iteratively • Move all irrelevant terms outside of innermost sum • Perform innermost sum, getting a new term • Insert the new term into the product
S V L T G A p(v0) p(s0) p(l|s0) t p(t|v0) g p(g|s0) a p(a|t,l)p(d0|a,g) x p(x|a) X D p(v0) p(s0) p(l|s0) t p(t|v0) g p(g|s0) a p(a|t,l)p(d0|a,g) bx (a) p(v0) p(s0) p(l|s0) t p(t|v0) g p(g|s0) ba(t,l,g) p(v0) p(s0) p(l|s0) t p(t|v0) bg(t,l) p(v0) p(s0) p(l|s0) bt(l) To obtain the posterior belief in L given the evidence we normalize the result to 1. Belief Update I Suppose get evidence V = vo, S = so, D = do We wish to compute P(l,vo,so,do) for every value l of L. P(l, vo,so,do ) = t,b,x,aP(vo,so,l,t,g,a,x,do) =
Bad summation order (variable A is summed first): P(l, do) = a,t,xP(a,t,x,l,do) = p(l|s0) xtap(a) p(do |a)p(t|a) p(x|a) Yields a three dimensional temporary table Next class we will see how to choose a reasonable order. Belief Update II T Suppose get evidence D = do We wish to compute P(l,do) for every value l of L. L A X D Good summation order (variable A is summed last): P(l, do) = a,t,xP(a,t,x,l,do) = p(l|s0) ap(a) p(do |a)t p(t|a) x p(x|a)
Initialization • Set the evidence in all (local probability) tables that are defined over some variables from E. Set an order to all variables not in E. • Partition all tables into buckets such that bucketi contains all tables whose highest indexed variable is Xi. • For p=m downto 1 do Suppose 1,…, j are the tables in bucketp being processed and suppose S1,…Sj are the respective set of variables in these tables. • Up the union of S1,…,Sj with Xp excluded • max the largest indexed variable in Up • For every assignment Up=u compute: • Add p (u) into bucketmax Return the vector {Def: is the value of u projected on Si.} The algorithm to compute P(x1,e)
Initialization • Set the evidence in all (local probability) tables that are defined over some variables from E. Set an order to all variables not in E. Partition all tables into buckets such that bucketi contains all tables whose highest indexed variable is Xi. • For p=m downto 1 do Suppose 1,…, j are the tables in bucketp being processed and suppose S1,…Sj are the respective set of variables in these tables. • Up the union of S1,…,Sj with Xp excluded • max the largest indexed variable in Up • For every assignment Up=u compute: • Add p (u) into bucketmax • Recover the map values of x1,…,xp using recursively the functions The MAP algorithm
d-separation Below are some slides from the tutorial.
S V L T B A X D Paths • Intuition: dependency must “flow” along paths in the graph • A path is a sequence of neighboring variables Examples: • X A D B • A L S B
Path blockage • Every path is classified given the evidence: • active -- creates a dependency between the end nodes • blocked – does not create a dependency between the end nodes Evidence means the assignment of a value to a subset of nodes.
S S Blocked Blocked Active L L B B Path Blockage Three cases: • Common cause
Blocked Blocked Active S S L L A A Path Blockage Three cases: • Common cause • Intermediate cause
Blocked Blocked Active T T T L L L X X X A A A Path Blockage Three cases: • Common cause • Intermediate cause • Common Effect
T L A Definition of Path Blockage Definition: A path is active, given evidence Z, if • Whenever we have the configurationthen either A or one of its descendents is in Z • No other nodes in the path are in Z. Definition: A path is blocked, given evidence Z, if it is not active. Definition: X is d-separated from Y, given Z, if all paths from a node in X and a node in Y are blocked, given Z.
d-sep(T,S) = yes S V L T B A X D Example
d-sep(T,S) = yes d-sep(T,S|D) = no S V L T B A X D Example
d-sepG (T,S) = yes d-sepG (T,S|D) = no d-sepG (T,S|{D,L,B}) = yes S V L T B A X D Example
d-Separation Main Theorem (Informally) • Soundness: Any independence reported by d-separation follows from basis(G) (and thus is satisfied by the underlying distribution). • Completeness: Any independence not reported by d-separation does not follow from basis(G).
S V L T B A X D Revisiting the First Example So does Ind( {X,D} ; V | A ) hold ?