1 / 54

Crash Course on Machine Learning Part V

Crash Course on Machine Learning Part V. Several slides from Derek Hoiem , Ben Taskar , and Andreas Krause. Structured Prediction. Use local information Exploit correlations. b. r. a. c. e. Min-max Formulation. LP duality. Before. QP duality.

cain
Download Presentation

Crash Course on Machine Learning Part V

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crash Course on Machine LearningPart V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

  2. Structured Prediction • Use local information • Exploit correlations b r a c e

  3. Min-max Formulation LP duality

  4. Before QP duality Exponentially many constraints/variables

  5. After By QP duality Dual inherits structure from problem-specific inference LP Variables  correspond to a decomposition of  variables of the flat case

  6. The Connection b c a r e 2 .2 b r o r e 2 .15 b r o c e .25 1 b r a c e .4 0 r c a 1 1 .65 .8 .6 e b  c r o .4 .35 .2

  7. Duals and Kernels • Kernel trick works: • Factored dual • Local functions (log-potentials) can use kernels

  8. 3D Mapping Data provided by: Michael Montemerlo & Sebastian Thrun Laser Range Finder GPS IMU Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points

  9. Alternatives: Perceptron • Simple iterative method • Unstable for structured output: fewer instances, big updates • May not converge if non-separable • Noisy • Voted / averaged perceptron [Freund & Schapire 99, Collins 02] • Regularize / reduce variance by aggregating over iterations

  10. Alternatives: Constraint Generation • Add most violated constraint • Handles several more general loss functions • Need to re-solve QP many times • Theorem: Only polynomial # of constraints needed to achieve -error [Tsochantaridis et al, 04] • Worst case # of constraints larger than factored [Collins 02; Altun et al, 03]

  11. Integration • Feature Passing • Margin Based • Max margin Structure Learning • Probabilistic • Graphical Models

  12. Graphical Models • Joint distribution • Factoring using independent variables • Representation • Inference • Learning

  13. Big Picture • Two problems with using full joint distribution tables as our probabilistic models: • Unless there are only a few variables, the joint is WAY too big to represent explicitly • Hard to learn (estimate) anything empirically about more than a few variables at a time • Describe complex joint distributions (models) using simple, local distributions • We describe how variables locally interact • Local interactions chain together to give global, indirect interactions

  14. Joint Distribution • For n variables with domain sizes d • joint distribution table with dn -1 free parameters • Size of representation if we use the chain rule Concretely, counting the number of free parameters accounting for that we know probabilities sum to one: (d-1) + d(d-1) + d2(d-1) + ... + dn-1 (d-1) = (dn-1)/(d-1) (d-1)= dn - 1

  15. Conditional Independence • Two variables are conditionally independent: • What about this domain? • Traffic • Umbrella • Raining

  16. Representation Explicitly model uncertainty and dependency structure Directed Undirected Factor graph a a a b b b c d c d c d Key concept: Markov blanket

  17. Cavity Toothache Catch Weather Bayes Net: Notation • Nodes: variables • Can be assigned (observed) or • unassigned (unobserved) • Arcs: interactions • Indicate “direct influence” between variables • Formally: encode conditional independence

  18. Example: Flip Coins • N independent flip coins • No interactions between variables • Absolute independence X1 X2 Xn

  19. Example: Traffic • Variables: • Traffic • Rain • Model 1: absolute independence • Model 2: rain causes traffic • Which makes more sense? Rain Traffic

  20. An A1 X Semantics • A set of nodes, one per variable X • A directed, acyclic graph • A conditional distribution for each node • A collection of distributions over X, one for each combination of parents’ values • Conditional Probability Table (CPT) Parents A2 A Bayes net =Topology (graph) + Local Conditional Probabilities

  21. Burglary Earthquake Radio Alarm Call Example: Alarm • Variables: • Alarm • Burglary • Earthquake • Radio • Calls John

  22. Burglary Earthquake Radio Alarm Call Example: Alarm P(E) P(B) P(C|A) P(A|E,B) P(R|E) P(E,B,R,A,C)=P(E)P(B)P(R|E)P(A|B,E)P(C|A)

  23. Bayes Net Size • How big is a joint distribution over N Boolean variables? 2n 2k+1 • How big is the size of CPT with k parents? • How big is the size of BN with n node if nodes have up to k parents? n.2k+1 • BNs: • Compact representation • Use local properties to define CPTS • Answer queries more easily

  24. Independence in BN • BNs present a compact representation for joint distributions • Take advantage of conditional independence • Given a BN let’s answer independence questions: • Are two nodes independent given certain evidence? • What can we say about X, Z? (Example: Low pressure, Rain, Traffic} Z X Y

  25. Causal Chains • Question: Is Z independent of X given Y? • X: low pressure • Y: Rain • Z: Traffic Z X Y

  26. Common Cause • Are X, Z independent? • Y: low pressure • X: Rain • Z: Cold • Are X, Z independent given Y? y Z X • Observing Y blocks the influence between X,Z

  27. Common Effect • Are X, Z independent? • X: Rain • Y: Traffic • Z: Ball Game X Z • Are X, Z independent given Y? Y • Observing Y activates influence between X, Z

  28. Burglary Earthquake Radio Alarm Call Independence in BNs • Any complex BN structure can be analyzed using these three cases

  29. Directed acyclical graph (Bayes net) a • Can model causality • Parameter learning • Decomposes: learn each term separately (ML) • Inference • Simple exact inference if tree-shaped (belief propagation) b c d P(a,b,c,d) = P(c|b)P(d|b)P(b|a)P(a)

  30. Directed acyclical graph (Bayes net) a • Can model causality • Parameter learning • Decomposes: learn each term separately (ML) • Inference • Simple exact inference if tree-shaped (belief propagation) • Loops require approximation • Loopy BP • Tree-reweighted BP • Sampling b c d P(a,b,c,d) = P(c|b)P(d|a,b)P(b|a)P(a)

  31. Directed graph • Example: Places and scenes Place: office, kitchen, street, etc. Objects Present Fire Hydrant Car Person Toaster Microwave P(place, car, person, toaster, micro, hydrant) = P(place) P(car | place) P(person | place) … P(hydrant | place)

  32. Undirected graph (Markov Networks) • Does not model causality • Often pairwise • Parameter learning difficult • Inference usually approximate x1 x2 x3 x4

  33. Markov Networks • Example: “label smoothing” grid Binary nodes Pairwise Potential 0 1 0 0 K 1 K 0

  34. Image De-Noising Original Image Noisy Image

  35. Image De-Noising

  36. Image De-Noising Noisy Image Restored Image (ICM)

  37. Factor graphs • A general representation Factor Graph a Bayes Net a b b c d c d

  38. Factor graphs • A general representation Markov Net Factor Graph a a b b c d c d

  39. Factor graphs Write as a factor graph

  40. Inference in Graphical Models • Joint • Marginal • Max • Exact inference is HARD

  41. Approximate Inference

  42. Approximation

  43. Sampling a Multinomial Distribution

  44. Sampling from a BN • Compute Marginals • Compute Conditionals

  45. Belief Propagation • Very general • Approximate, except for tree-shaped graphs • Generalizing variants BP can have better convergence for graphs with many loops or high potentials • Standard packages available (BNT toolbox) • To learn more: • Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Understanding Belief Propagation and Its Generalizations”, Technical Report, 2001: http://www.merl.com/publications/TR2001-022/

  46. i a Belief Propagation “beliefs”“messages” The “belief” is the BP approximation of the marginal probability.

More Related