260 likes | 392 Views
LAC group, 16/06/2011. PGM ch 4.1-4.2 notes. So far. Directed graphical models Bayesian Networks Useful because both the structure and the parameters provide a natural representation for many types of real-world domains. This chapter. Un directed graphical models
E N D
LAC group, 16/06/2011 PGM ch 4.1-4.2 notes
So far... • Directed graphical models • Bayesian Networks • Useful because both the structure and the parameters provide a natural representation for many types of real-world domains.
This chapter... • Undirected graphical models • Useful in modelling phenomena where we cannot determine the directionality of the interaction between the variables. • Offer a different, simpler perspective on directed models (both independence structure & inference task)
This chapter... • Introduce a framework that allows both directed and undirected edges • Note: some of the results in this chapter require that we restrict attention to distribution over discrete state spaces. • Discrete vs. continuous = boolean or real numbers e.g. 2.1.6
The 4 students example A (The misconception example sec. 3.4.2, ex.3.8) • 4 students who get together in pairs to work on their homework for a class. The pairs that meet are shown via the edges (lines) of this undirected graph : • A : Alice • B : Bobby • C : Charles • D : Debbie D B C
The 4 students example We want to model the following distribution: • A is independent of C given B and D • B is independent of D given A and C
The 4 students example PROBLEM 1: If we try to model these on a Bayesian network, we will be in trouble: • Any bayesian network I-map of such a distribution will have extraneous edges • At least one of the desired independence statements will not be captured (cont’d)
The 4 students example (cont’d) • Any bayesian will require from us to describe the directionality of the influence Also: • Interactions look symmetrical and we would like to model this somehow, without representing a direction of influence.
The 4 students example A SOLUTION 1: Undirected graph = (here) Markov network structure • Nodes (circles) represent variables • Edges (lines) represent a notion of direct probabilistic interaction between the neighbouring variables, not mediated by any other variable in the network. D B C
The 4 students example A PROBLEM 2: • How to parameterise this undirected graph? • CPD (conditional probability distribution) not useful, as the interaction is not directed • We would like to capture the affinities between the related variables e.g. Alice and Bobby are more likely to agree than disagree D B C
The 4 students example SOLUTION 2: • Associate A and B with a general purpose function : factor
The 4 students example • Here we focus only on non-negative factors. Factor: Let D be a set of random variables. We define a factor φ to be a function from Val(D) to R. A factor is non-negative if all its entries are non-negative. Scope: The set of variables D is called the scope of the factor and is denoted as Scope[φ].
The 4 students example • Let’s calculate the factor of A and B i.e. the fact that Alice and Bob are more likely to agree than disagree: φ1(A,B) : Val(A,B) to R+ The value associated with a particular assignment a,b denotes the affinity between the two values: the higher the value of φ1(A,B) the more compatible the two values are
The 4 students example • Fig 4.1/a shows one possible compatibility factor for A and B • Not normalised (see partial function later on how to do this) • 0: right, 1:wrong/has the misconception 0: right, 1:wrong/has the misconception
The 4 students example • φ1(A,B) asserts that: • it is more likely that Alice and Bob agree φ1(a0, b0), φ1(a1, b1) - they are more likely to be either both wrong or both right • If they disagree, Alice is more likely to be right (φ1(a0, b1)) than Bob (φ1(a1, b0)) 0: right, 1:wrong/has the misconception
The 4 students example • φ3(C,D) asserts that: • Charles and Debbie argue all the time and they will end up disagreeing any way : φ3(c0, d1) and φ3(c1, d0) 0: right, 1:wrong/has the misconception
The 4 students example So far: • defined the local interactions between variables/nodes/circles Next step: • Define a global model : need to combine these interactions = multiply them as with a Bayesian network
The 4 students example A possible GLOBAL MODEL: P(a,b,c,d) = φ1(a, b) ∙ φ2(b, c) ∙ φ3(c, d) ∙ φ4(d, a) PROBLEM: Nothing guarantees that the result is a normalised distribution (see fig. 4.2 middle column)
The 4 students example SOLUTION Take the product of the local factors and normalise it: P(a,b,c,d) = 1/Z ∙ φ1(a, b) ∙ φ2(b, c) ∙ φ3(c, d) ∙ φ4(d, a) Where Z= ∑ φ1(a, b) ∙ φ2(b, c) ∙ φ3(c, d) ∙ φ4(d, a) Z is a normalising constant known as partition function : partition as in markov random field in statistical physics; function , as Z is a function of the parameters [important for machine learning]
The 4 students example • See figure 4.2 for the calculations of the joint distribution • Calculate the partition function of a1,b1,c0,d1
The 4 students example • We can use the partition function/joint probability to answer questions like: • How likely is Bob to have a misconception? • How likely is Bob to have the misconception, given that Charles doesn’t?
The 4 students example • How likely is Bob to have the misconception? P(b1) ≈ 0.732 P(b0) ≈ 0.268 Bob is 26% less ?? likely to have the misconception
The 4 students example • How likely is Bob to have the misconception, given that Charles doesn’t? P(b1|c0) ≈ 0.06
The 4 students example Advantages of this approach: • Allows great flexibility in representing interactions between variables. • We can change the nature of interaction between A and B by simply modifying the entries in the factor without caring about normalisation constraints and the interaction of other factors
The 4 students example • Tight connection between factorisation of the distribution and its independence properties: • Factorisation:
The 4 students example • Using the formula in 3) we can decompose the distribution in several ways e.g. P(A,B,C,D) = [1/Z ∙ φ1(A, B) ∙ φ2(B, C)] ∙ φ3(C, D) ∙ φ4(A, D) and infer that