540 likes | 710 Views
Crash Course on Machine Learning Part V. Several slides from Derek Hoiem , Ben Taskar , and Andreas Krause. Structured Prediction. Use local information Exploit correlations. b. r. a. c. e. Min-max Formulation. LP duality. Before. QP duality.
E N D
Crash Course on Machine LearningPart V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause
Structured Prediction • Use local information • Exploit correlations b r a c e
Min-max Formulation LP duality
Before QP duality Exponentially many constraints/variables
After By QP duality Dual inherits structure from problem-specific inference LP Variables correspond to a decomposition of variables of the flat case
The Connection b c a r e 2 .2 b r o r e 2 .15 b r o c e .25 1 b r a c e .4 0 r c a 1 1 .65 .8 .6 e b c r o .4 .35 .2
Duals and Kernels • Kernel trick works: • Factored dual • Local functions (log-potentials) can use kernels
3D Mapping Data provided by: Michael Montemerlo & Sebastian Thrun Laser Range Finder GPS IMU Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points
Alternatives: Perceptron • Simple iterative method • Unstable for structured output: fewer instances, big updates • May not converge if non-separable • Noisy • Voted / averaged perceptron [Freund & Schapire 99, Collins 02] • Regularize / reduce variance by aggregating over iterations
Alternatives: Constraint Generation • Add most violated constraint • Handles several more general loss functions • Need to re-solve QP many times • Theorem: Only polynomial # of constraints needed to achieve -error [Tsochantaridis et al, 04] • Worst case # of constraints larger than factored [Collins 02; Altun et al, 03]
Integration • Feature Passing • Margin Based • Max margin Structure Learning • Probabilistic • Graphical Models
Graphical Models • Joint distribution • Factoring using independent variables • Representation • Inference • Learning
Big Picture • Two problems with using full joint distribution tables as our probabilistic models: • Unless there are only a few variables, the joint is WAY too big to represent explicitly • Hard to learn (estimate) anything empirically about more than a few variables at a time • Describe complex joint distributions (models) using simple, local distributions • We describe how variables locally interact • Local interactions chain together to give global, indirect interactions
Joint Distribution • For n variables with domain sizes d • joint distribution table with dn -1 free parameters • Size of representation if we use the chain rule Concretely, counting the number of free parameters accounting for that we know probabilities sum to one: (d-1) + d(d-1) + d2(d-1) + ... + dn-1 (d-1) = (dn-1)/(d-1) (d-1)= dn - 1
Conditional Independence • Two variables are conditionally independent: • What about this domain? • Traffic • Umbrella • Raining
Representation Explicitly model uncertainty and dependency structure Directed Undirected Factor graph a a a b b b c d c d c d Key concept: Markov blanket
Cavity Toothache Catch Weather Bayes Net: Notation • Nodes: variables • Can be assigned (observed) or • unassigned (unobserved) • Arcs: interactions • Indicate “direct influence” between variables • Formally: encode conditional independence
Example: Flip Coins • N independent flip coins • No interactions between variables • Absolute independence X1 X2 Xn
Example: Traffic • Variables: • Traffic • Rain • Model 1: absolute independence • Model 2: rain causes traffic • Which makes more sense? Rain Traffic
An A1 X Semantics • A set of nodes, one per variable X • A directed, acyclic graph • A conditional distribution for each node • A collection of distributions over X, one for each combination of parents’ values • Conditional Probability Table (CPT) Parents A2 A Bayes net =Topology (graph) + Local Conditional Probabilities
Burglary Earthquake Radio Alarm Call Example: Alarm • Variables: • Alarm • Burglary • Earthquake • Radio • Calls John
Burglary Earthquake Radio Alarm Call Example: Alarm P(E) P(B) P(C|A) P(A|E,B) P(R|E) P(E,B,R,A,C)=P(E)P(B)P(R|E)P(A|B,E)P(C|A)
Bayes Net Size • How big is a joint distribution over N Boolean variables? 2n 2k+1 • How big is the size of CPT with k parents? • How big is the size of BN with n node if nodes have up to k parents? n.2k+1 • BNs: • Compact representation • Use local properties to define CPTS • Answer queries more easily
Independence in BN • BNs present a compact representation for joint distributions • Take advantage of conditional independence • Given a BN let’s answer independence questions: • Are two nodes independent given certain evidence? • What can we say about X, Z? (Example: Low pressure, Rain, Traffic} Z X Y
Causal Chains • Question: Is Z independent of X given Y? • X: low pressure • Y: Rain • Z: Traffic Z X Y
Common Cause • Are X, Z independent? • Y: low pressure • X: Rain • Z: Cold • Are X, Z independent given Y? y Z X • Observing Y blocks the influence between X,Z
Common Effect • Are X, Z independent? • X: Rain • Y: Traffic • Z: Ball Game X Z • Are X, Z independent given Y? Y • Observing Y activates influence between X, Z
Burglary Earthquake Radio Alarm Call Independence in BNs • Any complex BN structure can be analyzed using these three cases
Directed acyclical graph (Bayes net) a • Can model causality • Parameter learning • Decomposes: learn each term separately (ML) • Inference • Simple exact inference if tree-shaped (belief propagation) b c d P(a,b,c,d) = P(c|b)P(d|b)P(b|a)P(a)
Directed acyclical graph (Bayes net) a • Can model causality • Parameter learning • Decomposes: learn each term separately (ML) • Inference • Simple exact inference if tree-shaped (belief propagation) • Loops require approximation • Loopy BP • Tree-reweighted BP • Sampling b c d P(a,b,c,d) = P(c|b)P(d|a,b)P(b|a)P(a)
Directed graph • Example: Places and scenes Place: office, kitchen, street, etc. Objects Present Fire Hydrant Car Person Toaster Microwave P(place, car, person, toaster, micro, hydrant) = P(place) P(car | place) P(person | place) … P(hydrant | place)
Undirected graph (Markov Networks) • Does not model causality • Often pairwise • Parameter learning difficult • Inference usually approximate x1 x2 x3 x4
Markov Networks • Example: “label smoothing” grid Binary nodes Pairwise Potential 0 1 0 0 K 1 K 0
Image De-Noising Original Image Noisy Image
Image De-Noising Noisy Image Restored Image (ICM)
Factor graphs • A general representation Factor Graph a Bayes Net a b b c d c d
Factor graphs • A general representation Markov Net Factor Graph a a b b c d c d
Factor graphs Write as a factor graph
Inference in Graphical Models • Joint • Marginal • Max • Exact inference is HARD
Sampling from a BN • Compute Marginals • Compute Conditionals
Belief Propagation • Very general • Approximate, except for tree-shaped graphs • Generalizing variants BP can have better convergence for graphs with many loops or high potentials • Standard packages available (BNT toolbox) • To learn more: • Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Understanding Belief Propagation and Its Generalizations”, Technical Report, 2001: http://www.merl.com/publications/TR2001-022/
i a Belief Propagation “beliefs”“messages” The “belief” is the BP approximation of the marginal probability.