Knowledge Representation & Reasoning Lecture #5

Knowledge Representation & ReasoningLecture #5 UIUC CS 498: Section EAProfessor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro Cardenas (UMD) (in turn based on slides by Nir Friedman (Hebrew U)))

So Far and Today • Probabilistic graphical models • Bayes Networks (Directed GMs) • Markov Fields (Undirected GMs) • Treewidth methods: • Variable elimination • Clique tree algorithm • Applications du jour: Sensor Networks

Y1 Y2 X Non-descendent Markov Assumption Ancestor Parent • We now make this independence assumption more precise for directed acyclic graphs (DAGs) • Each random variable X, is independent of its non-descendents, given its parents Pa(X) • Formally,I (X, NonDesc(X) | Pa(X)) Non-descendent Descendent

Burglary Earthquake Radio Alarm Call Markov Assumption Example • In this example: • I ( E, B ) • I ( B, {E, R} ) • I ( R, {A, B, C} | E ) • I ( A, R | B,E ) • I ( C, {B, E, R} | A)

X Y X Y I-Maps • A DAG G is an I-Map of a distribution P if all Markov assumptions implied by G are satisfied by P (Assuming G and P both use the same set of random variables) Examples:

Proof: • By chain rule: • wlog. X1,…,Xpis an ordering consistent with G • Hence, Factorization Theorem Thm: if G is an I-Map of P, then From assumption: • Since G is an I-Map, I (Xi, NonDesc(Xi)| Pa(Xi)) • We conclude, P(Xi | X1,…,Xi-1) = P(Xi | Pa(Xi) )

Consequences • We can write P in terms of “local” conditional probabilities If G is sparse, • that is, |Pa(Xi)| < k ,  each conditional probability can be specified compactly • e.g. for binary variables, these require O(2k) params. representation of P is compact • linear in number of variables

Summary We defined the following concepts • The Markov Independences of a DAG G • I (Xi , NonDesc(Xi) | Pai ) • G is an I-Map of a distribution P • If P satisfies the Markov independencies implied by G We proved the factorization theorem • if G is an I-Map of P, then

Conditional Independencies • Let Markov(G) be the set of Markov Independencies implied by G • The factorization theorem shows G is an I-Map of P  • We can also show the opposite: Thm:  Gis an I-Map of P

Proof (Outline) X Z Example: Y

Implied Independencies • Does a graph G imply additional independencies as a consequence of Markov(G)? • We can define a logic of independence statements • Some axioms: • I( X ; Y | Z )  I( Y; X | Z ) • I( X ; Y1, Y2 | Z )  I( X; Y1 | Z )

d-separation • A procedure d-sep(X; Y | Z, G) that given a DAG G, and sets X, Y, and Z returns either yes or no • Goal: d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows from Markov(G)

Burglary Earthquake Radio Alarm Call Paths • Intuition: dependency must “flow” along paths in the graph • A path is a sequence of neighboring variables Examples: • R  E  A  B • C A E  R

Paths • We want to know when a path is • active -- creates dependency between end nodes • blocked -- cannot create dependency end nodes • We want to classify situations in which paths are active.

E E Blocked Blocked Unblocked Active R R A A Path Blockage Three cases: • Common cause

Blocked Blocked Unblocked Active E E A A C C Path Blockage Three cases: • Common cause • Intermediate cause

Blocked Blocked Unblocked Active E E E B B B A A A C C C Path Blockage Three cases: • Common cause • Intermediate cause • Common Effect

Path Blockage -- General Case A path is active, given evidence Z, if • Whenever we have the configurationB or one of its descendents are in Z • No other nodes in the path are in Z A path is blocked, given evidence Z, if it is not active. A C B

d-sep(R,B)? Example E B R A C

d-sep(R,B) = yes d-sep(R,B|A)? Example E B R A C

d-sep(R,B) = yes d-sep(R,B|A) = no d-sep(R,B|E,A)? Example E B R A C

d-Separation • X is d-separated from Y, given Z, if all paths from a node in X to a node in Y are blocked, given Z. • Checking d-separation can be done efficiently (linear time in number of edges) • Bottom-up phase: Mark all nodes whose descendents are in Z • X to Y phase:Traverse (BFS) all edges on paths from X to Y and check if they are blocked

Soundness Thm: If • G is an I-Map of P • d-sep( X; Y | Z, G ) = yes • then • P satisfies I( X; Y | Z ) Informally: Any independence reported by d-separation is satisfied by underlying distribution

Completeness Thm: If d-sep( X; Y | Z, G ) = no • then there is a distribution P such that • G is an I-Map of P • P does not satisfy I( X; Y | Z ) Informally: Any independence not reported by d-separation might be violated by the underlying distribution • We cannot determine this by examining the graph structure alone

Summary: Structure • We explored DAGs as a representation of conditional independencies: • Markov independencies of a DAG • Tight correspondence between Markov(G) and the factorization defined by G • d-separation, a sound & complete procedure for computing the consequences of the independencies • Notion of minimal I-Map • P-Maps • This theory is the basis for defining Bayesian networks

Complexity of variable elimination • Suppose in one elimination step we compute This requires • multiplications • For each value for x, y1, …, yk, we do m multiplications • additions • For each value of y1, …, yk , we do |Val(X)| additions Complexity is exponential in number of variables in the intermediate factor

Undirected graph representation • At each stage of the procedure, we have an algebraic term that we need to evaluate • In general this term is of the form:where Zi are sets of variables • We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor • that is, if X,Y are in some Zi • Note: this is the Markov network that describes the probability on the variables we did not eliminate yet

S V L T B A S V X D L T B A X D Chordal Graphs • elimination ordering  undirected chordal graph Graph: • Maximal cliques are factors in elimination • Factors in elimination are cliques in the graph • Complexity is exponential in size of the largest clique in graph

Induced Width • The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination • This quantity is called the induced width of a graph according to the specified ordering • Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph

A H C B E D F G PolyTrees • A polytree is a network where there is at most one path from one variable to another Thm: • Inference in a polytree is linear in the representation size of the network • This assumes tabular CPT representation

Today • Probabilistic graphical models • Treewidth methods: • Variable elimination • Clique tree algorithm • Applications du jour: Sensor Networks

Junction Tree • Why junction tree? • More efficient for some tasks than variable elimination • We can avoid cycles if we turn highly-interconnected subsets of the nodes into “supernodes”  cluster • Objective • Compute • is a value of a variable and is evidence for a set of variable

ABD ADE DEF AD DE Cluster ABD SepsetDE Properties of Junction Tree • An undirected tree • Each node is a cluster (nonempty set) of variables • Running intersection property: • Given two clusters and , all clusters on the path between and contain • Separator sets (sepsets): • Intersection of the adjacent cluster

Potentials • Potentials: • Denoted by • Marginalization • , the marginalization of into X • Multiplication • , the multiplication of and

Properties of Junction Tree • Belief potentials: • Map each instantiation of clusters or sepsets into a real number • Constraints: • Consistency: for each cluster and neighboring sepset • The joint distribution

Properties of Junction Tree • If a junction tree satisfies the properties, it follows that: • For each cluster (or sepset) , • The probability distribution of any variable , using any cluster (or sepset) that contains

Moral Graph Triangulated Graph Junction Tree Identifying Cliques Building Junction Trees DAG

A B C G D E H F Constructing the Moral Graph

A B C G D E H F Constructing The Moral Graph • Add undirected edges to all co-parents which are not currently joined –Marrying parents

A B C G D E H F Constructing The Moral Graph • Add undirected edges to all co-parents which are not currently joined –Marrying parents • Drop the directions of the arcs

A B C G D E H F Triangulating • An undirected graph is triangulated iff every cycle of length >3 contains an edge to connects two nonadjacent nodes

EGH CEG A B C G DEF ACE D E H ABD ADE F Identifying Cliques • A clique is a subgraph of an undirected graph that is complete and maximal

EGH CEG ABD ACE CEG ADE AD AE CE DEF ACE DE EG ABD ADE DEF EGH Junction Tree • A junction tree is a subgraph of the clique graph that • is a tree • contains all the cliques • satisfies the running intersection property

DAG Junction Tree Initialization Inconsistent Junction Tree Propagation Consistent Junction Tree Marginalization Principle of Inference

X1 X2 Y1 Y2 X1,Y1 X2,Y2 X1,X2 X1 X2 Example: Create Join Tree HMM with 2 time steps: Junction Tree:

X1,Y1 X2,Y2 X1,X2 X1 X2 Example: Initialization

Example: Collect Evidence • Choose arbitrary clique, e.g. X1,X2, where all potential functions will be collected. • Call recursively neighboring cliques for messages: • 1. Call X1,Y1. • 1. Projection: • 2. Absorption:

X1,Y1 X2,Y2 X1,X2 X1 X2 Example: Collect Evidence (cont.) • 2. Call X2,Y2: • 1. Projection: • 2. Absorption:

Example: Distribute Evidence • Pass messages recursively to neighboring nodes • Pass message from X1,X2 to X1,Y1: • 1. Projection: • 2. Absorption:

Knowledge Representation & Reasoning Lecture #5