410 likes | 429 Views
Learn about Markov Random Fields, Conditional Independence, Inference, Factorization, and more in Graphical Models for Pattern Recognition and Machine Learning as summarized by B.-H. Kim. Explore the algorithms and techniques used in graphical models for various applications.
E N D
Ch 8. Graphical ModelsPattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by B.-H. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/
Contents • 8.3 Markov Random Field • 8.3.1 Conditional independence properties • 8.3.2 Factorization properties • 8.3.3 Illustration: Image de-noising • 8.3.4 Relation to directed graphs • 8.4 Inference in Graphical Models • 8.4.1 Inference on a chain • 8.4.2 Trees • 8.4.3 Factor graphs • 8.4.4 The sum-product algorithm • 8.4.5 The max-sum algorithm • 8.4.6 Exact Inference in general graphs • 8.4.7 Loopy belief propagation • 8.4.8 Learning the graph structure (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Directed graph vs. undirected graph • Both graphical model • Specify a factorization (how to express the joint distribution) • Define a set of conditional independence properties Parent - child Local conditional distribution Maximal clique Potential function • Chain graphs: graphs that include both directed and undirected links (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.3.1 Conditional independence properties Shaded circle: evidence, i.e. observed variables • In directed graphs • ‘D-separation’ test: if the paths connecting two sets of nodes are ‘blocked’ • Subtle case: ‘head-to-head’ nodes • In undirected graphs • Simple graph separation (simpler than in directed graphs) • Checking all the paths btw A and B • if all the paths are blocked by C or not • After removing C, if there is any path remaining • Markov blanket for an undirected graph (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.3.2 Factorization properties • A maximal clique • Clique: a subset of the nodes in a graph s.t. there exists a link btw all pairs of nodes in the subset • Functions of the maximal cliques become the factors in the decomposition of the joint distribution • Potential function Partition function (normalization constant) • Potential functions are not restricted to marginal or conditional distributions • Normalization constant: major limitation of undirected graph. • But we can overcome when we focus on local conditional distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.3.2 Factorization properties • Considering formal connection btw conditional independence and factorization • Restriction: should be strictly positive • Hammersley-Clifford theorem • Expressing potential functions in exponential form and are identical (a graphical model as a filter) : energy function Boltzmann distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.3.3 Illustration: Image de-noising (1) • Setting • Image as a set of ‘binary pixel values’ {-1, +1} • In the observed noisy image • In the unknown noise-free image • Noise: randomly flipping the sign of pixels with some small probability • Goal: to recover the original noise-free image (noisy-image: 10% noise) (Original image) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.3.3 Illustration: Image de-noising (2) • Prior knowledge (when the noise level is small) • Strong correlation between and • Strong correlation between neighboring pixels and • Corresponding Markov random field • A simple energy function for the cliques • form : • form : • Bias (preference of one particular sign) : • The complete energy function for the model / joint distribution : Ising model (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.3.3 Illustration: Image de-noising (3) • Image restoration results • Iterated conditional modes (ICM) • Coordinate-wise gradient ascent • Initialization: for all I • Take one node, evaluate the total energy, change the state of the node if it results in lower energy • Repeat till some stopping criterion is satisfied • Graph-cut algorithm • Guaranteed to find the global maximum in Ising model original 10% noise Restored by ICM (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Restored by graph-cut
8.3.4 Relation to directed graphs (1) • Converting a directed graph to un undirected graph • Case 1: straight line • In this case, the partition function Z=1 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.3.4 Relation to directed graphs (2) • Converting a directed graph to un undirected graph • Case 2: general case. Moralization, ‘marrying the parents’ • Add additional undirected links btw all pairs of parents • Drop the arrows • Result in the moral graph • Fully connected -> no conditional independence properties, in contrast to the original directed graph • We should add the fewest extra links to retain the maximum number of independence properties Usage example: Exact inference algorithm Ex) junction tree alg. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
filtered 8.3.4 Relation to directed graphs (3) • Directed and undirected graphs can express different conditional independence properties specific view: graphical model as a filter (map) D map Ex) completely disconnected graph is a trivial D map for any distribution I map Ex) fully connected graph is a trivial I map for any distribution Perfect map = both I&D map (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.3.4 Relation to directed graphs (4) • D: the set of distributions that can be represented as a perfect map using a directed graph • U: ~ using a undirected graph (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Contents • 8.3 Markov Random Field • 8.3.1 Conditional independence properties • 8.3.2 Factorization properties • 8.3.3 Illustration: Image de-noising • 8.3.4 Relation to directed graphs • 8.4 Inference in Graphical Models • 8.4.1 Inference on a chain • 8.4.2 Trees • 8.4.3 Factor graphs • 8.4.4 The sum-product algorithm • 8.4.5 The max-sum algorithm • 8.4.6 Exact Inference in general graphs • 8.4.7 Loopy belief propagation • 8.4.8 Learning the graph structure (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
ABD A BCD CDE B D C E Introduction / Guidelines • Inference in graphical models • Given evidences (some nodes are clamped to observed values) • Wish to compute the posterior distributions of other nodes • Inference algorithms in graphical structures • Main idea: propagation of local messages • Exact inference: section 8.4 • Sum-product algorithm, max-product algorithm, junction tree algorithm • Approximate inference: chapter 10, 11 • Loopy belief propagation + message passing schedule (8.4.7) • Variational methods, sampling methods (Monte Carlo methods) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Graphical interpretation of Bayes’ theorem • Given structure: • We observe the value of y • Goal: infer the posterior distribution over x, • Marginal distribution : a prior over the latent variable x • We can evaluate the marginal distribution • By Bayes’ theorem we can calculate (a) (b) (c) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.1 Inference on a chain (1) • Specific setting • N nodes, each discrete node has K states • => each potential function: K by K table, total (N-1)K2 parameters • Problem: inference the marginal distribution • Naïve implementation • first evaluate the joint distribution and then perform the summations explicitly => KN values for x, exponential growth with N • Efficient algorithm: exploiting the conditional independence • Each summation effectively removes a variable from the distribution ` (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.1 Inference on a chain (2) • The desired marginal is expressed as following • Key concept of the underlying idea • multiplication is distributive over addition • The computational cost is linear in the length of a chain 3 op. 2 op. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.1 Inference on a chain (3) • Powerful interpretation of (8.52) • passing of local messages around on the graph • Recursive evaluation of message A message passed forwards A message passed backwards (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.1 Inference on a chain (4) • Evaluation of the marginals for every node in the chain • If some of the nodes in the graph are observed • Corresponding variables are clamped => no summation • The joint distribution is multiplied by • Calculating the joint distribution for two neighbouring nodes One by one separately => wasteful, duplicated Storing all of the intermediate messages along the way (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.2 Trees • Efficient exact inference using local message passing • In case of a chain: linear time in the number of nodes • More general case: trees • Sum-product algorithm • A tree in an undirected graph • There is one, and only one, path btw any pair of nodes • A tree in a directed graph • Root: single node which has no parents • All other nodes have one parent • Conversion to an undirected graph => undirected tree with no more links added during the moralization step • Polytree • A directed graph that have more than one parent, • but there is still only one path btw any two nodes (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.3 Factor graphs (1) • Factor graphs • Introducing additional nodes for the factors themselves • Explicit decomposition /factorization • Joint distribution in the form of a product of factors • Factors in directed/undirected graphs • example (Factor graphs are bipartite) factor (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.3 Factor graphs (2) • Conversion • An undirected graph => factor graph • A directed graph => factor graph • There can be multiple factor graphs all of which correspond to the same undirected/directed graph (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.3 Factor graphs (3) • Converting directed/undirected tree to a factor graph • The result is again a tree (no loops, one and only one path connecting any two nodes) • In the case of a directed polytree • To undirected: results in loops due to the moralization step • To factor graphs: we can avoid loops (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.3 Factor graphs (4) • Local cycles in a directed graph can be removed on conversion to a factor graph • Factor graphs are more specific about the precise form of the factorization No corresponding conditional independence properties (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.4 The sum-product algorithm (0) • The sum-product algorithm • allows us to take a joint distribution p(x) expressed as a factor graph and efficiently find marginals over the component variables • Exact inference algorithm that are applicable to tree-structured graphs • The max-sum algorithm • A technique to find the most probable state (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.4 The sum-product algorithm (1) • Basic setting • Suppose that all of the variables are discrete, and so marginalization corresponds to performing sums (the framework is equally applicable to linear-Gaussian models) • The original graph is un undirected tree or a directed tree or polytree => corresponding factor graph has a tree structure • Goal: exact inference for finding marginals (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.4 The sum-product algorithm (2) • Two distinct kinds of message • From factor nodes to variable nodes: • From variable nodes to factor nodes: View x as the root Factorization: (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.4 The sum-product algorithm (3) Recursive computation of messages … Two cases in leaf nodes • Each node can send a message towards the root • once it has received messages from all of its other neighbours • Once the root node has received messages from all of its neighbours, • the required marginal can be evaluated (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.4 The sum-product algorithm (4) • To find the marginals for every variable node in the graph • Running the algorithm for each node => wasteful • Efficient procedure: by ‘overlaying’ multiple message passing • Step 1: arbitrarily pick any node, designate it as the root • Step 2: propagate messages from the leaves to the root • Step 3: now, the root node received messages from all of its neighbours=>send out messages outwards all the way to the leaves • By now, a message have passed in both directions across every link, and every node received a message from all of its neighbours • We can readily calculate the marginal distribution for every variable in the graph (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.4 The sum-product algorithm (5) • Issue of normalization • If the factor graph was derived from a directed graph • The joint distribution is already correctly normalized • If from un undirected graph • Unknown normalization coefficient 1/Z • We first run the sum-product algorithm to find the corresponding unnormalized marginals => obtain 1/Z after then (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.4 The sum-product algorithm (6-1) • A simple example to illustrate the operation of the sum-product algorithm Designate node x3 as the root. Then leaf nodes are x1 and x4 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.4 The sum-product algorithm (6-2) • A simple example to illustrate the operation of the sum-product algorithm (cont’d) • From leaves to the root • From the root to leaves (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.4 The sum-product algorithm (6-3) • A simple example to illustrate the operation of the sum-product algorithm (cont’d) Sum-product algorithm applied to a graph of linear-Gaussian variables => Linear dynamical systems (LDS) in chapter 13 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.5 The max-sum algorithm (1) • Goal of the algorithm • To find a setting of the variables that has the larges probability • To find the value of that probability • An application of dynamic programming in the context of graphical models • Problem description Exchanging the max and product operators results in a much more efficient computation (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.5 The max-sum algorithm (2) • In practice, to prevent numerical underflow in products of small probabilities, we take logarithm • Logarithm is a monotonic function • The distributive property is preserved max-sum algorithm (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.5 The max-sum algorithm (3) • Finding the configuration of the variables for which the joint distribution attains its maximum value • We need a rather different kind of message passing • keeping track of which values of the variables gave rise to the maximum state of each variable • For each state of a given variable, there is a unique state of the previous variable • that maximizes the probability => indicated by the lines connecting the nodes • by back-tracking we can build a globally consistent maximizing configuration (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.5 The max-sum algorithm (4) • The max-sum algorithm, with back-tracking, gives an exact maximizing configuration for the variables provided the factor graph is a tree • Important application: the Viterbi algorithm in HMM (ch. 13) • For many practical applications, we have to deal with graphs having loops • Generalization of the message passing framework to arbitrary graph topology => junction tree algorithm (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.6 Exact Inference in general graphs • Junction tree algorithm • Refer explanation in the textbook • At its heart is the simple idea that we have used already of exploiting the factorization properties of • the distribution to allow sum and products to be interchanged • So that partial summations can be performed, avoiding having to work directly with the joint distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.7 Loopy belief propagation • For many problems of practical interests, we use approximation methods • Variational methods => Ch. 10 • Sampling methods, also called Monte Carlo methods => Ch. 11 • One simple approach to approximate inference in graphs with loops • Simply apply the sum-product algorithm even though there is no guarantee that it will yield good results: loopy belief propagation • We need to define a message passing schedule • Flooding schedule, serial schedules, pending messages (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8.4.8 Learning the graph structure • Learning the graph structure itself from data requires • A space of possible structures • A measure that can be used to score each structure • From a Bayesian viewpoint • Tough points • Marginalization over latent variables => challenging computational problem • Exploring the space of structures can also be problematic • The # of different graph structures grows exponentially with the # of nodes • Usually we resort to heuristics : score for each model (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/