Graphical model software for machine learning

Graphical modelsoftware for machine learning Kevin Murphy University of British Columbia December, 2005

Outline • Discriminative models for iid data • Beyond iid data: conditional random fields • Beyond supervised learning: generative models • Beyond optimization: Bayesian models

Supervised learning as Bayesian inference Training Testing   Y1 Yn YN Y* Y* X1 Xn XN X* X* N

Supervised learning as optimization Training Testing   Y1 Yn YN Y* Y* X1 Xn XN X* X* N

Example: logistic regression • Let yn2 {1,…,C} be given by a softmax • Maximize conditional log likelihood • “Max margin” solution

1D chain CRFs for sequence labeling A 1D conditional random field (CRF) is an extension of logistic regressionto the case where the output labels are sequences, yn2 {1,…,C}m Edge potential Local evidence  ij Yn1 Yn2 Ynm i Xn

2D Lattice CRFs for pixel labeling A conditional randomfield (CRF) is a discriminative modelof P(y|x). The edge potentialsij are image dependent.

2D Lattice MRFs for pixel labeling Local evidence Potential function Partition function A Markov Random Field (MRF) is an undirectedgraphical model. Here we model correlation between pixel labels using ij(yi,yj). We also have a per-pixelgenerative model of observations P(xi|yi)

Tree-structured CRFs • Used in parts-based object detection • Yi is location of part i in image nose eyeR eyeL mouth Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73 Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05

General CRFs • In general, the graph may have arbitrary structure • eg for collective web page classification,nodes=urls, edges=hyperlinks • The potentials are in general defined on cliques, not just edges

Factor graphs Square nodes = factors (potentials) Round nodes = random variables Graph structure = bipartite

Potential functions • For the local evidence, we can use a discriminative classifier (trained iid) • For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features

 l Restricted potential functions • For some applications (esp in vision), we often use a Potts model of the form • We can generalize this for ordered labels (eg discretization of continuous states)

Learning CRFs • If the log likelihood is • then the gradient is Tied params cliques Gradient = features – expected features

Learning CRFs • Given the gradient rd, one can find the global optimum using first or second order optimization methods, such as • Conjugate gradient • Limited memory BFGS • Stochastic meta descent (SMD)? • The bottleneck is computing the expected features needed for the gradient

Exact inference • For 1D chains, one can compute P(yi,i+1|x) exactly in O(N K2) time using belief propagation (BP = forwards backwards algorithm) • For restricted potentials (eg ij=( l)), one can do this in O(NK) time using FFT-like tricks • This can be generalized to trees.

Sum-product vs max-product • We use sum-product to compute marginal probabilities needed for learning • We use max-product to find the most probable assignment (Viterbi decoding) • We can also compute max-marginals

Complexity of exact inference In general, the running time is (N Kw), where w is the treewidthof the graph; this is the size of the maximal clique of the triangulatedgraph (assuming an optimal elimination ordering). For chains and trees, w = 2. For n £ n lattices, w = O(n).

Approximate sum-product

Approximate max-product

Learning intractable CRFs • We can use approximate inference and hope the gradient is “good enough”. • If we use max-product, we are doing “Viterbi training” (cf perceptron rule) • Or we can use other techniques, such as pseudo likelihood, which does not need inference.

Pseudo-likelihood

Software for inference and learning in 1D CRFs • Various packages • Mallet (McCallum et al) – Java • Crf.sourceforge.net (Sarawagi, Cohen) – Java • My code – matlab (just a toy, not integrated with BNT) • Ben Taskar says he will soon release his Max Margin Markov net code (which uses LP for inference and QP for learning). • Nothing standard, emphasis on NLP apps

Software for inference in general CRFs/ MRFs • Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al • “A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother • Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference) • Sum-product: various other ad hoc pieces • My matlab BP code (MRF2) • Rivasseau’s C++ code for BP, Gibbs, tree-sampling (factor graphs) • Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)

Software for learning general MRFs/CRFs • Hardly any! • Parise’s matlab code (approx gradient, pseudo likelihood, CD, etc) • My matlab code (IPF, approx gradient – just a toy – not integrated with BNT)

learnEngine trainData infEngine infEngine queries model model Nbest list decide probDist Structure of ideal toolbox Generator/GUI/file train testData infer decisionEngine performance decision visualize summarize utilities

learnEngine trainData infEngine infEngine queries model model Nbest list decide probDist Structure of BNT LeRay Shan Generator/GUI/file Graphs+CPDs Cell array BPJtree MCMC EM StructuralEM train testData NodeIds VarElim Graphs+CPDs Cell array infer JtreeVarElim decisionEngine policy Array, Gaussian, samples N=1 (MAP) visualize summarize LIMID

Unsupervised learning: why? • Labeling data is time-consuming. • Often not clear what label to use. • Complex objects often not describable with a single discrete label. • Humans learn without labels. • Want to discover novel patterns/ structure.

Unsupervised learning: what? • Clusters (eg GMM) • Low dim manifolds (eg PCA) • Graph structure (eg biology, social networks) • “Features” (eg maxent models of language and texture) • “Objects” (eg sprite models in vision)

Unsupervised learning of objects from video Frey and Jojic; Williams and Titsias ; et al

Unsupervised learning: issues • Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression). • Local minima (non convex objective). • Uses inference as subroutine (can be slow – no worse than discriminative learning)

Unsupervised learning: how? • Construct a generative model (eg a Bayes net). • Perform inference. • May have to use approximations such as maximum likelihood and BP. • Cannot use max likelihood for model selection…

A comparison of BN software www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html

Popular BN software • BNT (matlab) • Intel’s PNL (C++) • Hugin (commercial) • Netica (commercial) • GMTk (free .exe from Jeff Bilmes)

Bayesian inference: why? • It is optimal. • It can easily incorporate prior knowledge (esp. useful for small n, large p problems). • It properly reports confidence in output (useful for combining estimates, and for risk-averse applications). • It separates models from algorithms.

Bayesian inference: how? • Since we want to integrate, we cannot use max-product. • Since the unknown parameters are continuous, we cannot use sum-product. • But we can use EP (expectation propagation), which is similar to BP. • We can also use variational inference. • Or MCMC (eg Gibbs sampling).

General purposeBayesian software • BUGS (Gibbs sampling) • VIBES (variational message passing) • Minka and Winn’s toolbox (infer.net)

learnEngine trainData infEngine infEngine queries model model decide probDist Structure of ideal Bayesian toolbox Generator/ GUI/ file train testData infer decisionEngine performance decision visualize summarize utilities

Graphical model software for machine learning