Graphical model software for machine learning

Graphical modelsoftware for machine learning Kevin Murphy University of British Columbia December, 2005

Outline • Discriminative models for iid data • Beyond iid data: conditional random fields • Beyond supervised learning: generative models • Beyond optimization: Bayesian models

Supervised learning as Bayesian inference Training Testing   Y1 Yn YN Y* Y* X1 Xn XN X* X* N

Supervised learning as optimization Training Testing   Y1 Yn YN Y* Y* X1 Xn XN X* X* N

Example: logistic regression • Let yn2 {1,…,C} be given by a softmax • Maximize conditional log likelihood • “Max margin” solution

1D chain CRFs for sequence labeling A 1D conditional random field (CRF) is an extension of logistic regressionto the case where the output labels are sequences, yn2 {1,…,C}m Edge potential Local evidence  ij Yn1 Yn2 Ynm i Xn

2D Lattice CRFs for pixel labeling A conditional randomfield (CRF) is a discriminative modelof P(y|x). The edge potentialsij are image dependent.

2D Lattice MRFs for pixel labeling Local evidence Potential function Partition function A Markov Random Field (MRF) is an undirectedgraphical model. Here we model correlation between pixel labels using ij(yi,yj). We also have a per-pixelgenerative model of observations P(xi|yi)

Tree-structured CRFs • Used in parts-based object detection • Yi is location of part i in image nose eyeR eyeL mouth Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73 Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05

General CRFs • In general, the graph may have arbitrary structure • eg for collective web page classification,nodes=urls, edges=hyperlinks • The potentials are in general defined on cliques, not just edges

Factor graphs Square nodes = factors (potentials) Round nodes = random variables Graph structure = bipartite

Potential functions • For the local evidence, we can use a discriminative classifier (trained iid) • For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features

 l Restricted potential functions • For some applications (esp in vision), we often use a Potts model of the form • We can generalize this for ordered labels (eg discretization of continuous states)

Learning CRFs • If the log likelihood is • then the gradient is Tied params cliques Gradient = features – expected features

Learning CRFs • Given the gradient rd, one can find the global optimum using first or second order optimization methods, such as • Conjugate gradient • Limited memory BFGS • Stochastic meta descent (SMD)? • The bottleneck is computing the expected features needed for the gradient

Exact inference • For 1D chains, one can compute P(yi,i+1|x) exactly in O(N K2) time using belief propagation (BP = forwards backwards algorithm) • For restricted potentials (eg ij=( l)), one can do this in O(NK) time using FFT-like tricks • This can be generalized to trees.

Sum-product vs max-product • We use sum-product to compute marginal probabilities needed for learning • We use max-product to find the most probable assignment (Viterbi decoding) • We can also compute max-marginals

Complexity of exact inference In general, the running time is (N Kw), where w is the treewidthof the graph; this is the size of the maximal clique of the triangulatedgraph (assuming an optimal elimination ordering). For chains and trees, w = 2. For n £ n lattices, w = O(n).

Approximate sum-product

Approximate max-product

Learning intractable CRFs • We can use approximate inference and hope the gradient is “good enough”. • If we use max-product, we are doing “Viterbi training” (cf perceptron rule) • Or we can use other techniques, such as pseudo likelihood, which does not need inference.

Pseudo-likelihood

Software for inference and learning in 1D CRFs • Various packages • Mallet (McCallum et al) – Java • Crf.sourceforge.net (Sarawagi, Cohen) – Java • My code – matlab (just a toy, not integrated with BNT) • Ben Taskar says he will soon release his Max Margin Markov net code (which uses LP for inference and QP for learning). • Nothing standard, emphasis on NLP apps

Software for inference in general CRFs/ MRFs • Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al • “A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother • Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference) • Sum-product: various other ad hoc pieces • My matlab BP code (MRF2) • Rivasseau’s C++ code for BP, Gibbs, tree-sampling (factor graphs) • Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)

Software for learning general MRFs/CRFs • Hardly any! • Parise’s matlab code (approx gradient, pseudo likelihood, CD, etc) • My matlab code (IPF, approx gradient – just a toy – not integrated with BNT)

learnEngine trainData infEngine infEngine queries model model Nbest list decide probDist Structure of ideal toolbox Generator/GUI/file train testData infer decisionEngine performance decision visualize summarize utilities

learnEngine trainData infEngine infEngine queries model model Nbest list decide probDist Structure of BNT LeRay Shan Generator/GUI/file Graphs+CPDs Cell array BPJtree MCMC EM StructuralEM train testData NodeIds VarElim Graphs+CPDs Cell array infer JtreeVarElim decisionEngine policy Array, Gaussian, samples N=1 (MAP) visualize summarize LIMID

Unsupervised learning: why? • Labeling data is time-consuming. • Often not clear what label to use. • Complex objects often not describable with a single discrete label. • Humans learn without labels. • Want to discover novel patterns/ structure.

Unsupervised learning: what? • Clusters (eg GMM) • Low dim manifolds (eg PCA) • Graph structure (eg biology, social networks) • “Features” (eg maxent models of language and texture) • “Objects” (eg sprite models in vision)

Unsupervised learning of objects from video Frey and Jojic; Williams and Titsias ; et al

Unsupervised learning: issues • Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression). • Local minima (non convex objective). • Uses inference as subroutine (can be slow – no worse than discriminative learning)

Unsupervised learning: how? • Construct a generative model (eg a Bayes net). • Perform inference. • May have to use approximations such as maximum likelihood and BP. • Cannot use max likelihood for model selection…

A comparison of BN software www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html

Popular BN software • BNT (matlab) • Intel’s PNL (C++) • Hugin (commercial) • Netica (commercial) • GMTk (free .exe from Jeff Bilmes)

Bayesian inference: why? • It is optimal. • It can easily incorporate prior knowledge (esp. useful for small n, large p problems). • It properly reports confidence in output (useful for combining estimates, and for risk-averse applications). • It separates models from algorithms.

Bayesian inference: how? • Since we want to integrate, we cannot use max-product. • Since the unknown parameters are continuous, we cannot use sum-product. • But we can use EP (expectation propagation), which is similar to BP. • We can also use variational inference. • Or MCMC (eg Gibbs sampling).

General purposeBayesian software • BUGS (Gibbs sampling) • VIBES (variational message passing) • Minka and Winn’s toolbox (infer.net)

learnEngine trainData infEngine infEngine queries model model decide probDist Structure of ideal Bayesian toolbox Generator/ GUI/ file train testData infer decisionEngine performance decision visualize summarize utilities

Graphical model software for machine learning

Graphical model software for machine learning

Presentation Transcript

An introduction to machine learning and probabilistic graphical models

Graphical Multi-Task Learning

Machine Learning Hidden Markov Model

Hands-on predictive models and machine learning for software

Learning Adjective-Noun Selectional Preference Using Probabilistic Graphical Model

Graphical Models for Machine Learning and Computer Vision

Learning Graphical model of robotics What learning can (not) do? Why learning?

Learning the structure of Deep sparse Graphical Model

Graphical Ontology Editor for Simulation Model Integration

Graphical Models - Learning -

Graphical Models in Machine Learning

A Machine Learning Approach for Automatic Student Model Discovery

Machine Learning For Beginners

Machine learning Courses | Machine Learning Training

Graphical model software for machine learning

Graphical Models for Machine Learning and Computer Vision

Graphical Models - Learning -

Automated Software Maintainability through Machine Learning