570 likes | 588 Views
Knowledge Representation & Reasoning Lecture #5. UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005. (Based on slides by Lise Getoor and Alvaro Cardenas (UMD) (in turn based on slides by Nir Friedman (Hebrew U))). So Far and Today. Probabilistic graphical models
E N D
Knowledge Representation & ReasoningLecture #5 UIUC CS 498: Section EAProfessor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro Cardenas (UMD) (in turn based on slides by Nir Friedman (Hebrew U)))
So Far and Today • Probabilistic graphical models • Bayes Networks (Directed GMs) • Markov Fields (Undirected GMs) • Treewidth methods: • Variable elimination • Clique tree algorithm • Applications du jour: Sensor Networks
Y1 Y2 X Non-descendent Markov Assumption Ancestor Parent • We now make this independence assumption more precise for directed acyclic graphs (DAGs) • Each random variable X, is independent of its non-descendents, given its parents Pa(X) • Formally,I (X, NonDesc(X) | Pa(X)) Non-descendent Descendent
Burglary Earthquake Radio Alarm Call Markov Assumption Example • In this example: • I ( E, B ) • I ( B, {E, R} ) • I ( R, {A, B, C} | E ) • I ( A, R | B,E ) • I ( C, {B, E, R} | A)
X Y X Y I-Maps • A DAG G is an I-Map of a distribution P if all Markov assumptions implied by G are satisfied by P (Assuming G and P both use the same set of random variables) Examples:
Proof: • By chain rule: • wlog. X1,…,Xpis an ordering consistent with G • Hence, Factorization Theorem Thm: if G is an I-Map of P, then From assumption: • Since G is an I-Map, I (Xi, NonDesc(Xi)| Pa(Xi)) • We conclude, P(Xi | X1,…,Xi-1) = P(Xi | Pa(Xi) )
Burglary Earthquake Radio Alarm Call Factorization Example P(C,A,R,E,B) = P(B)P(E|B)P(R|E,B)P(A|R,B,E)P(C|A,R,B,E) versus P(C,A,R,E,B) = P(B) P(E) P(R|E) P(A|B,E) P(C|A)
Consequences • We can write P in terms of “local” conditional probabilities If G is sparse, • that is, |Pa(Xi)| < k , each conditional probability can be specified compactly • e.g. for binary variables, these require O(2k) params. representation of P is compact • linear in number of variables
Summary We defined the following concepts • The Markov Independences of a DAG G • I (Xi , NonDesc(Xi) | Pai ) • G is an I-Map of a distribution P • If P satisfies the Markov independencies implied by G We proved the factorization theorem • if G is an I-Map of P, then
Conditional Independencies • Let Markov(G) be the set of Markov Independencies implied by G • The factorization theorem shows G is an I-Map of P • We can also show the opposite: Thm: Gis an I-Map of P
Proof (Outline) X Z Example: Y
Implied Independencies • Does a graph G imply additional independencies as a consequence of Markov(G)? • We can define a logic of independence statements • Some axioms: • I( X ; Y | Z ) I( Y; X | Z ) • I( X ; Y1, Y2 | Z ) I( X; Y1 | Z )
d-separation • A procedure d-sep(X; Y | Z, G) that given a DAG G, and sets X, Y, and Z returns either yes or no • Goal: d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows from Markov(G)
Burglary Earthquake Radio Alarm Call Paths • Intuition: dependency must “flow” along paths in the graph • A path is a sequence of neighboring variables Examples: • R E A B • C A E R
Paths • We want to know when a path is • active -- creates dependency between end nodes • blocked -- cannot create dependency end nodes • We want to classify situations in which paths are active.
E E Blocked Blocked Unblocked Active R R A A Path Blockage Three cases: • Common cause
Blocked Blocked Unblocked Active E E A A C C Path Blockage Three cases: • Common cause • Intermediate cause
Blocked Blocked Unblocked Active E E E B B B A A A C C C Path Blockage Three cases: • Common cause • Intermediate cause • Common Effect
Path Blockage -- General Case A path is active, given evidence Z, if • Whenever we have the configurationB or one of its descendents are in Z • No other nodes in the path are in Z A path is blocked, given evidence Z, if it is not active. A C B
d-sep(R,B)? Example E B R A C
d-sep(R,B) = yes d-sep(R,B|A)? Example E B R A C
d-sep(R,B) = yes d-sep(R,B|A) = no d-sep(R,B|E,A)? Example E B R A C
d-Separation • X is d-separated from Y, given Z, if all paths from a node in X to a node in Y are blocked, given Z. • Checking d-separation can be done efficiently (linear time in number of edges) • Bottom-up phase: Mark all nodes whose descendents are in Z • X to Y phase:Traverse (BFS) all edges on paths from X to Y and check if they are blocked
Soundness Thm: If • G is an I-Map of P • d-sep( X; Y | Z, G ) = yes • then • P satisfies I( X; Y | Z ) Informally: Any independence reported by d-separation is satisfied by underlying distribution
Completeness Thm: If d-sep( X; Y | Z, G ) = no • then there is a distribution P such that • G is an I-Map of P • P does not satisfy I( X; Y | Z ) Informally: Any independence not reported by d-separation might be violated by the underlying distribution • We cannot determine this by examining the graph structure alone
Summary: Structure • We explored DAGs as a representation of conditional independencies: • Markov independencies of a DAG • Tight correspondence between Markov(G) and the factorization defined by G • d-separation, a sound & complete procedure for computing the consequences of the independencies • Notion of minimal I-Map • P-Maps • This theory is the basis for defining Bayesian networks
Complexity of variable elimination • Suppose in one elimination step we compute This requires • multiplications • For each value for x, y1, …, yk, we do m multiplications • additions • For each value of y1, …, yk , we do |Val(X)| additions Complexity is exponential in number of variables in the intermediate factor
Undirected graph representation • At each stage of the procedure, we have an algebraic term that we need to evaluate • In general this term is of the form:where Zi are sets of variables • We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor • that is, if X,Y are in some Zi • Note: this is the Markov network that describes the probability on the variables we did not eliminate yet
S V L T B A S V X D L T B A X D Chordal Graphs • elimination ordering undirected chordal graph Graph: • Maximal cliques are factors in elimination • Factors in elimination are cliques in the graph • Complexity is exponential in size of the largest clique in graph
Induced Width • The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination • This quantity is called the induced width of a graph according to the specified ordering • Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph
A H C B E D F G PolyTrees • A polytree is a network where there is at most one path from one variable to another Thm: • Inference in a polytree is linear in the representation size of the network • This assumes tabular CPT representation
Today • Probabilistic graphical models • Treewidth methods: • Variable elimination • Clique tree algorithm • Applications du jour: Sensor Networks
Junction Tree • Why junction tree? • More efficient for some tasks than variable elimination • We can avoid cycles if we turn highly-interconnected subsets of the nodes into “supernodes” cluster • Objective • Compute • is a value of a variable and is evidence for a set of variable
ABD ADE DEF AD DE Cluster ABD SepsetDE Properties of Junction Tree • An undirected tree • Each node is a cluster (nonempty set) of variables • Running intersection property: • Given two clusters and , all clusters on the path between and contain • Separator sets (sepsets): • Intersection of the adjacent cluster
Potentials • Potentials: • Denoted by • Marginalization • , the marginalization of into X • Multiplication • , the multiplication of and
Properties of Junction Tree • Belief potentials: • Map each instantiation of clusters or sepsets into a real number • Constraints: • Consistency: for each cluster and neighboring sepset • The joint distribution
Properties of Junction Tree • If a junction tree satisfies the properties, it follows that: • For each cluster (or sepset) , • The probability distribution of any variable , using any cluster (or sepset) that contains
Moral Graph Triangulated Graph Junction Tree Identifying Cliques Building Junction Trees DAG
A B C G D E H F Constructing the Moral Graph
A B C G D E H F Constructing The Moral Graph • Add undirected edges to all co-parents which are not currently joined –Marrying parents
A B C G D E H F Constructing The Moral Graph • Add undirected edges to all co-parents which are not currently joined –Marrying parents • Drop the directions of the arcs
A B C G D E H F Triangulating • An undirected graph is triangulated iff every cycle of length >3 contains an edge to connects two nonadjacent nodes
EGH CEG A B C G DEF ACE D E H ABD ADE F Identifying Cliques • A clique is a subgraph of an undirected graph that is complete and maximal
EGH CEG ABD ACE CEG ADE AD AE CE DEF ACE DE EG ABD ADE DEF EGH Junction Tree • A junction tree is a subgraph of the clique graph that • is a tree • contains all the cliques • satisfies the running intersection property
DAG Junction Tree Initialization Inconsistent Junction Tree Propagation Consistent Junction Tree Marginalization Principle of Inference
X1 X2 Y1 Y2 X1,Y1 X2,Y2 X1,X2 X1 X2 Example: Create Join Tree HMM with 2 time steps: Junction Tree:
X1,Y1 X2,Y2 X1,X2 X1 X2 Example: Initialization
Example: Collect Evidence • Choose arbitrary clique, e.g. X1,X2, where all potential functions will be collected. • Call recursively neighboring cliques for messages: • 1. Call X1,Y1. • 1. Projection: • 2. Absorption:
X1,Y1 X2,Y2 X1,X2 X1 X2 Example: Collect Evidence (cont.) • 2. Call X2,Y2: • 1. Projection: • 2. Absorption:
Example: Distribute Evidence • Pass messages recursively to neighboring nodes • Pass message from X1,X2 to X1,Y1: • 1. Projection: • 2. Absorption: