Understanding Graphical Models in Machine Learning

Ch 8. Graphical ModelsPattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by B.-H. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

Contents • 8.3 Markov Random Field • 8.3.1 Conditional independence properties • 8.3.2 Factorization properties • 8.3.3 Illustration: Image de-noising • 8.3.4 Relation to directed graphs • 8.4 Inference in Graphical Models • 8.4.1 Inference on a chain • 8.4.2 Trees • 8.4.3 Factor graphs • 8.4.4 The sum-product algorithm • 8.4.5 The max-sum algorithm • 8.4.6 Exact Inference in general graphs • 8.4.7 Loopy belief propagation • 8.4.8 Learning the graph structure (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Directed graph vs. undirected graph • Both graphical model • Specify a factorization (how to express the joint distribution) • Define a set of conditional independence properties Parent - child Local conditional distribution Maximal clique Potential function • Chain graphs: graphs that include both directed and undirected links (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.3.1 Conditional independence properties Shaded circle: evidence, i.e. observed variables • In directed graphs • ‘D-separation’ test: if the paths connecting two sets of nodes are ‘blocked’ • Subtle case: ‘head-to-head’ nodes • In undirected graphs • Simple graph separation (simpler than in directed graphs) • Checking all the paths btw A and B • if all the paths are blocked by C or not • After removing C, if there is any path remaining • Markov blanket for an undirected graph (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.3.2 Factorization properties • A maximal clique • Clique: a subset of the nodes in a graph s.t. there exists a link btw all pairs of nodes in the subset • Functions of the maximal cliques become the factors in the decomposition of the joint distribution • Potential function Partition function (normalization constant) • Potential functions are not restricted to marginal or conditional distributions • Normalization constant: major limitation of undirected graph. • But we can overcome when we focus on local conditional distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.3.2 Factorization properties • Considering formal connection btw conditional independence and factorization • Restriction: should be strictly positive • Hammersley-Clifford theorem • Expressing potential functions in exponential form and are identical (a graphical model as a filter) : energy function Boltzmann distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.3.3 Illustration: Image de-noising (1) • Setting • Image as a set of ‘binary pixel values’ {-1, +1} • In the observed noisy image • In the unknown noise-free image • Noise: randomly flipping the sign of pixels with some small probability • Goal: to recover the original noise-free image (noisy-image: 10% noise) (Original image) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.3.3 Illustration: Image de-noising (2) • Prior knowledge (when the noise level is small) • Strong correlation between and • Strong correlation between neighboring pixels and • Corresponding Markov random field • A simple energy function for the cliques • form : • form : • Bias (preference of one particular sign) : • The complete energy function for the model / joint distribution : Ising model (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.3.3 Illustration: Image de-noising (3) • Image restoration results • Iterated conditional modes (ICM) • Coordinate-wise gradient ascent • Initialization: for all I • Take one node, evaluate the total energy, change the state of the node if it results in lower energy • Repeat till some stopping criterion is satisfied • Graph-cut algorithm • Guaranteed to find the global maximum in Ising model original 10% noise Restored by ICM (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Restored by graph-cut

8.3.4 Relation to directed graphs (1) • Converting a directed graph to un undirected graph • Case 1: straight line • In this case, the partition function Z=1 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.3.4 Relation to directed graphs (2) • Converting a directed graph to un undirected graph • Case 2: general case. Moralization, ‘marrying the parents’ • Add additional undirected links btw all pairs of parents • Drop the arrows • Result in the moral graph • Fully connected -> no conditional independence properties, in contrast to the original directed graph • We should add the fewest extra links to retain the maximum number of independence properties Usage example: Exact inference algorithm Ex) junction tree alg. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

filtered 8.3.4 Relation to directed graphs (3) • Directed and undirected graphs can express different conditional independence properties specific view: graphical model as a filter (map) D map Ex) completely disconnected graph is a trivial D map for any distribution I map Ex) fully connected graph is a trivial I map for any distribution Perfect map = both I&D map (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.3.4 Relation to directed graphs (4) • D: the set of distributions that can be represented as a perfect map using a directed graph • U: ~ using a undirected graph (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Contents • 8.3 Markov Random Field • 8.3.1 Conditional independence properties • 8.3.2 Factorization properties • 8.3.3 Illustration: Image de-noising • 8.3.4 Relation to directed graphs • 8.4 Inference in Graphical Models • 8.4.1 Inference on a chain • 8.4.2 Trees • 8.4.3 Factor graphs • 8.4.4 The sum-product algorithm • 8.4.5 The max-sum algorithm • 8.4.6 Exact Inference in general graphs • 8.4.7 Loopy belief propagation • 8.4.8 Learning the graph structure (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

ABD A BCD CDE B D C E Introduction / Guidelines • Inference in graphical models • Given evidences (some nodes are clamped to observed values) • Wish to compute the posterior distributions of other nodes • Inference algorithms in graphical structures • Main idea: propagation of local messages • Exact inference: section 8.4 • Sum-product algorithm, max-product algorithm, junction tree algorithm • Approximate inference: chapter 10, 11 • Loopy belief propagation + message passing schedule (8.4.7) • Variational methods, sampling methods (Monte Carlo methods) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Graphical interpretation of Bayes’ theorem • Given structure: • We observe the value of y • Goal: infer the posterior distribution over x, • Marginal distribution : a prior over the latent variable x • We can evaluate the marginal distribution • By Bayes’ theorem we can calculate (a) (b) (c) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.1 Inference on a chain (1) • Specific setting • N nodes, each discrete node has K states • => each potential function: K by K table, total (N-1)K2 parameters • Problem: inference the marginal distribution • Naïve implementation • first evaluate the joint distribution and then perform the summations explicitly => KN values for x, exponential growth with N • Efficient algorithm: exploiting the conditional independence • Each summation effectively removes a variable from the distribution ` (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.1 Inference on a chain (2) • The desired marginal is expressed as following • Key concept of the underlying idea • multiplication is distributive over addition • The computational cost is linear in the length of a chain 3 op. 2 op. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.1 Inference on a chain (3) • Powerful interpretation of (8.52) • passing of local messages around on the graph • Recursive evaluation of message A message passed forwards A message passed backwards (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.1 Inference on a chain (4) • Evaluation of the marginals for every node in the chain • If some of the nodes in the graph are observed • Corresponding variables are clamped => no summation • The joint distribution is multiplied by • Calculating the joint distribution for two neighbouring nodes One by one separately => wasteful, duplicated Storing all of the intermediate messages along the way (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.2 Trees • Efficient exact inference using local message passing • In case of a chain: linear time in the number of nodes • More general case: trees • Sum-product algorithm • A tree in an undirected graph • There is one, and only one, path btw any pair of nodes • A tree in a directed graph • Root: single node which has no parents • All other nodes have one parent • Conversion to an undirected graph => undirected tree with no more links added during the moralization step • Polytree • A directed graph that have more than one parent, • but there is still only one path btw any two nodes (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.3 Factor graphs (1) • Factor graphs • Introducing additional nodes for the factors themselves • Explicit decomposition /factorization • Joint distribution in the form of a product of factors • Factors in directed/undirected graphs • example (Factor graphs are bipartite) factor (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.3 Factor graphs (2) • Conversion • An undirected graph => factor graph • A directed graph => factor graph • There can be multiple factor graphs all of which correspond to the same undirected/directed graph (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.3 Factor graphs (3) • Converting directed/undirected tree to a factor graph • The result is again a tree (no loops, one and only one path connecting any two nodes) • In the case of a directed polytree • To undirected: results in loops due to the moralization step • To factor graphs: we can avoid loops (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.3 Factor graphs (4) • Local cycles in a directed graph can be removed on conversion to a factor graph • Factor graphs are more specific about the precise form of the factorization No corresponding conditional independence properties (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.4 The sum-product algorithm (0) • The sum-product algorithm • allows us to take a joint distribution p(x) expressed as a factor graph and efficiently find marginals over the component variables • Exact inference algorithm that are applicable to tree-structured graphs • The max-sum algorithm • A technique to find the most probable state (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.4 The sum-product algorithm (1) • Basic setting • Suppose that all of the variables are discrete, and so marginalization corresponds to performing sums (the framework is equally applicable to linear-Gaussian models) • The original graph is un undirected tree or a directed tree or polytree => corresponding factor graph has a tree structure • Goal: exact inference for finding marginals (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.4 The sum-product algorithm (2) • Two distinct kinds of message • From factor nodes to variable nodes: • From variable nodes to factor nodes: View x as the root Factorization: (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.4 The sum-product algorithm (3) Recursive computation of messages … Two cases in leaf nodes • Each node can send a message towards the root • once it has received messages from all of its other neighbours • Once the root node has received messages from all of its neighbours, • the required marginal can be evaluated (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.4 The sum-product algorithm (4) • To find the marginals for every variable node in the graph • Running the algorithm for each node => wasteful • Efficient procedure: by ‘overlaying’ multiple message passing • Step 1: arbitrarily pick any node, designate it as the root • Step 2: propagate messages from the leaves to the root • Step 3: now, the root node received messages from all of its neighbours=>send out messages outwards all the way to the leaves • By now, a message have passed in both directions across every link, and every node received a message from all of its neighbours • We can readily calculate the marginal distribution for every variable in the graph (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.4 The sum-product algorithm (5) • Issue of normalization • If the factor graph was derived from a directed graph • The joint distribution is already correctly normalized • If from un undirected graph • Unknown normalization coefficient 1/Z • We first run the sum-product algorithm to find the corresponding unnormalized marginals => obtain 1/Z after then (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.4 The sum-product algorithm (6-1) • A simple example to illustrate the operation of the sum-product algorithm Designate node x3 as the root. Then leaf nodes are x1 and x4 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.4 The sum-product algorithm (6-2) • A simple example to illustrate the operation of the sum-product algorithm (cont’d) • From leaves to the root • From the root to leaves (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.4 The sum-product algorithm (6-3) • A simple example to illustrate the operation of the sum-product algorithm (cont’d) Sum-product algorithm applied to a graph of linear-Gaussian variables => Linear dynamical systems (LDS) in chapter 13 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.5 The max-sum algorithm (1) • Goal of the algorithm • To find a setting of the variables that has the larges probability • To find the value of that probability • An application of dynamic programming in the context of graphical models • Problem description Exchanging the max and product operators results in a much more efficient computation (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.5 The max-sum algorithm (2) • In practice, to prevent numerical underflow in products of small probabilities, we take logarithm • Logarithm is a monotonic function • The distributive property is preserved max-sum algorithm (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.5 The max-sum algorithm (3) • Finding the configuration of the variables for which the joint distribution attains its maximum value • We need a rather different kind of message passing • keeping track of which values of the variables gave rise to the maximum state of each variable • For each state of a given variable, there is a unique state of the previous variable • that maximizes the probability => indicated by the lines connecting the nodes • by back-tracking we can build a globally consistent maximizing configuration (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.5 The max-sum algorithm (4) • The max-sum algorithm, with back-tracking, gives an exact maximizing configuration for the variables provided the factor graph is a tree • Important application: the Viterbi algorithm in HMM (ch. 13) • For many practical applications, we have to deal with graphs having loops • Generalization of the message passing framework to arbitrary graph topology => junction tree algorithm (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.6 Exact Inference in general graphs • Junction tree algorithm • Refer explanation in the textbook • At its heart is the simple idea that we have used already of exploiting the factorization properties of • the distribution to allow sum and products to be interchanged • So that partial summations can be performed, avoiding having to work directly with the joint distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.7 Loopy belief propagation • For many problems of practical interests, we use approximation methods • Variational methods => Ch. 10 • Sampling methods, also called Monte Carlo methods => Ch. 11 • One simple approach to approximate inference in graphs with loops • Simply apply the sum-product algorithm even though there is no guarantee that it will yield good results: loopy belief propagation • We need to define a message passing schedule • Flooding schedule, serial schedules, pending messages (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8.4.8 Learning the graph structure • Learning the graph structure itself from data requires • A space of possible structures • A measure that can be used to score each structure • From a Bayesian viewpoint • Tough points • Marginalization over latent variables => challenging computational problem • Exploring the space of structures can also be problematic • The # of different graph structures grows exponentially with the # of nodes • Usually we resort to heuristics : score for each model (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Understanding Graphical Models in Machine Learning

Understanding Graphical Models in Machine Learning

Presentation Transcript

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Pattern Recognition and Machine Learning

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M. Bishop, 2006.