1 / 42

Graphical Model Software for Machine Learning - Kevin Murphy, University of British Columbia, December 2005

This outline covers discriminative models for iid data, conditional random fields for non-iid data, generative models, and Bayesian models in machine learning. Various types of graphical models such as 1D and 2D CRFs, MRFs, tree-structured CRFs, and general CRFs are discussed. The text also delves into factor graphs, learning CRFs, exact and approximate inference methods, complexity, learning intractable CRFs, and software options for inference and learning in CRFs and MRFs.

whistlerj
Download Presentation

Graphical Model Software for Machine Learning - Kevin Murphy, University of British Columbia, December 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graphical modelsoftware for machine learning Kevin Murphy University of British Columbia December, 2005

  2. Outline • Discriminative models for iid data • Beyond iid data: conditional random fields • Beyond supervised learning: generative models • Beyond optimization: Bayesian models

  3. Supervised learning as Bayesian inference Training Testing   Y1 Yn YN Y* Y* X1 Xn XN X* X* N

  4. Supervised learning as optimization Training Testing   Y1 Yn YN Y* Y* X1 Xn XN X* X* N

  5. Example: logistic regression • Let yn2 {1,…,C} be given by a softmax • Maximize conditional log likelihood • “Max margin” solution

  6. Outline • Discriminative models for iid data • Beyond iid data: conditional random fields • Beyond supervised learning: generative models • Beyond optimization: Bayesian models

  7. 1D chain CRFs for sequence labeling A 1D conditional random field (CRF) is an extension of logistic regressionto the case where the output labels are sequences, yn2 {1,…,C}m Edge potential Local evidence  ij Yn1 Yn2 Ynm i Xn

  8. 2D Lattice CRFs for pixel labeling A conditional randomfield (CRF) is a discriminative modelof P(y|x). The edge potentialsij are image dependent.

  9. 2D Lattice MRFs for pixel labeling Local evidence Potential function Partition function A Markov Random Field (MRF) is an undirectedgraphical model. Here we model correlation between pixel labels using ij(yi,yj). We also have a per-pixelgenerative model of observations P(xi|yi)

  10. Tree-structured CRFs • Used in parts-based object detection • Yi is location of part i in image nose eyeR eyeL mouth Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73 Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05

  11. General CRFs • In general, the graph may have arbitrary structure • eg for collective web page classification,nodes=urls, edges=hyperlinks • The potentials are in general defined on cliques, not just edges

  12. Factor graphs Square nodes = factors (potentials) Round nodes = random variables Graph structure = bipartite

  13. Potential functions • For the local evidence, we can use a discriminative classifier (trained iid) • For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features

  14. l Restricted potential functions • For some applications (esp in vision), we often use a Potts model of the form • We can generalize this for ordered labels (eg discretization of continuous states)

  15. Learning CRFs • If the log likelihood is • then the gradient is Tied params cliques Gradient = features – expected features

  16. Learning CRFs • Given the gradient rd, one can find the global optimum using first or second order optimization methods, such as • Conjugate gradient • Limited memory BFGS • Stochastic meta descent (SMD)? • The bottleneck is computing the expected features needed for the gradient

  17. Exact inference • For 1D chains, one can compute P(yi,i+1|x) exactly in O(N K2) time using belief propagation (BP = forwards backwards algorithm) • For restricted potentials (eg ij=( l)), one can do this in O(NK) time using FFT-like tricks • This can be generalized to trees.

  18. Sum-product vs max-product • We use sum-product to compute marginal probabilities needed for learning • We use max-product to find the most probable assignment (Viterbi decoding) • We can also compute max-marginals

  19. Complexity of exact inference In general, the running time is (N Kw), where w is the treewidthof the graph; this is the size of the maximal clique of the triangulatedgraph (assuming an optimal elimination ordering). For chains and trees, w = 2. For n £ n lattices, w = O(n).

  20. Approximate sum-product

  21. Approximate max-product

  22. Learning intractable CRFs • We can use approximate inference and hope the gradient is “good enough”. • If we use max-product, we are doing “Viterbi training” (cf perceptron rule) • Or we can use other techniques, such as pseudo likelihood, which does not need inference.

  23. Pseudo-likelihood

  24. Software for inference and learning in 1D CRFs • Various packages • Mallet (McCallum et al) – Java • Crf.sourceforge.net (Sarawagi, Cohen) – Java • My code – matlab (just a toy, not integrated with BNT) • Ben Taskar says he will soon release his Max Margin Markov net code (which uses LP for inference and QP for learning). • Nothing standard, emphasis on NLP apps

  25. Software for inference in general CRFs/ MRFs • Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al • “A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother • Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference) • Sum-product: various other ad hoc pieces • My matlab BP code (MRF2) • Rivasseau’s C++ code for BP, Gibbs, tree-sampling (factor graphs) • Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)

  26. Software for learning general MRFs/CRFs • Hardly any! • Parise’s matlab code (approx gradient, pseudo likelihood, CD, etc) • My matlab code (IPF, approx gradient – just a toy – not integrated with BNT)

  27. learnEngine trainData infEngine infEngine queries model model Nbest list decide probDist Structure of ideal toolbox Generator/GUI/file train testData infer decisionEngine performance decision visualize summarize utilities

  28. learnEngine trainData infEngine infEngine queries model model Nbest list decide probDist Structure of BNT LeRay Shan Generator/GUI/file Graphs+CPDs Cell array BPJtree MCMC EM StructuralEM train testData NodeIds VarElim Graphs+CPDs Cell array infer JtreeVarElim decisionEngine policy Array, Gaussian, samples N=1 (MAP) visualize summarize LIMID

  29. Outline • Discriminative models for iid data • Beyond iid data: conditional random fields • Beyond supervised learning: generative models • Beyond optimization: Bayesian models

  30. Unsupervised learning: why? • Labeling data is time-consuming. • Often not clear what label to use. • Complex objects often not describable with a single discrete label. • Humans learn without labels. • Want to discover novel patterns/ structure.

  31. Unsupervised learning: what? • Clusters (eg GMM) • Low dim manifolds (eg PCA) • Graph structure (eg biology, social networks) • “Features” (eg maxent models of language and texture) • “Objects” (eg sprite models in vision)

  32. Unsupervised learning of objects from video Frey and Jojic; Williams and Titsias ; et al

  33. Unsupervised learning: issues • Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression). • Local minima (non convex objective). • Uses inference as subroutine (can be slow – no worse than discriminative learning)

  34. Unsupervised learning: how? • Construct a generative model (eg a Bayes net). • Perform inference. • May have to use approximations such as maximum likelihood and BP. • Cannot use max likelihood for model selection…

  35. A comparison of BN software www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html

  36. Popular BN software • BNT (matlab) • Intel’s PNL (C++) • Hugin (commercial) • Netica (commercial) • GMTk (free .exe from Jeff Bilmes)

  37. Outline • Discriminative models for iid data • Beyond iid data: conditional random fields • Beyond supervised learning: generative models • Beyond optimization: Bayesian models

  38. Bayesian inference: why? • It is optimal. • It can easily incorporate prior knowledge (esp. useful for small n, large p problems). • It properly reports confidence in output (useful for combining estimates, and for risk-averse applications). • It separates models from algorithms.

  39. Bayesian inference: how? • Since we want to integrate, we cannot use max-product. • Since the unknown parameters are continuous, we cannot use sum-product. • But we can use EP (expectation propagation), which is similar to BP. • We can also use variational inference. • Or MCMC (eg Gibbs sampling).

  40. General purposeBayesian software • BUGS (Gibbs sampling) • VIBES (variational message passing) • Minka and Winn’s toolbox (infer.net)

  41. learnEngine trainData infEngine infEngine queries model model decide probDist Structure of ideal Bayesian toolbox Generator/ GUI/ file train testData infer decisionEngine performance decision visualize summarize utilities

More Related