560 likes | 688 Views
Learning Structured Prediction Models: A Large Margin Approach. Ben Taskar U.C. Berkeley Vassil Chatalbashev Michael Collins Carlos Guestrin Dan Klein Daphne Koller Chris Manning.
E N D
Learning Structured Prediction Models:A Large Margin Approach Ben Taskar U.C. Berkeley Vassil Chatalbashev Michael Collins Carlos Guestrin Dan Klein Daphne Koller Chris Manning
“Don’t worry, Howard. The big questions are multiple choice.”
Handwriting recognition x y brace Sequential structure
Object segmentation x y Spatial structure
Natural language parsing x y The screen was a sea of red Recursive structure
Disulfide connectivity prediction x y RSCCPCYWGGCPW GQNCYPEGCSGPKV Combinatorial structure
Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Geometric View • Structured model polytopes • Linear programming inference • Structured large margin estimation • Min-max formulation • Application: 3D object segmentation • Certificate formulation • Application: disulfide connectivity prediction
Structured models Mild assumption: linear combination scoring function space of feasible outputs
a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) P(y|x) i (xi,yi)i (yi,yi+1) (xi,yi) = exp{ wf(xi,yi)} (yi,yi+1)= exp{ wf (yi,yi+1)} f(y,y’) = I(y=‘z’,y’=‘a’) y f(x,y) = I(xp=1, y=‘z’) x *Lafferty et al. 01
a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) P(y|x)i (xi,yi)i (yi,yi+1) = exp{wTf(x,y)} w = [… ,w, … , w, …] f(x,y) = [… ,f(x,y), … , f(x,y), …] i(xi,yi) = exp{ w if(xi,yi)} i(yi,yi+1)= exp{ w if (yi,yi+1)} f(x,y) = #(y=‘z’,y’=‘a’) y f(x,y) = #(xp=1, y=‘z’) x *Lafferty et al. 01
Associative Markov Nets Edge features Point features spin-images, point height length of edge, edge orientation “associative” restriction i yi ij yj
PCFG #(NP DT NN) … #(PP IN NP) … #(NN ‘sea’)
2 3 1 4 6 5 Disulfide bonds: non-bipartite matching RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 1 6 2 5 4 3 Fariselli & Casadio `01, Baldi et al. ‘04
2 3 1 4 6 5 Scoring function RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 String features: residues, physical properties
Structured models Mild assumption: Another mild assumption: linear programming scoring function space of feasible outputs
MAP inference linear program • LP inference for • Chains • Trees • Associative Markov Nets • Bipartite Matchings • …
Markov Net Inference LP Has integral solutions y for chains, trees Gives upper bound for general networks
Associative MN Inference LP “associative” restriction • For K=2, solutions are always integral (optimal) • For K>2, within factor of 2 of optimal • Constraint matrix A is linear in number of nodes and edges, regardless of the tree-width
Other Inference LPs • Context-free parsing • Dynamic programs • Bipartite matching • Network flow • Many other combinatorial problems
Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Geometric View • Structured model polytopes • Linear programming inference • Structured large margin estimation • Min-max formulation • Application: 3D object segmentation • Certificate formulation • Application: disulfide connectivity prediction
Learning w • Training example (x, y*) • Probabilistic approach: • Maximize conditional likelihood • Problem: computing Zw(x) is #P-complete
Geometric Example Training data: Goal: Learn w s.t.wTf( , y*) points the “right” way
OCR Example • We want: argmaxword wT f(,word) = “brace” • Equivalently: wTf(,“brace”) > wTf( ,“aaaaa”) wTf(,“brace”) > wTf( ,“aaaab”) … wTf(,“brace”) > wTf( ,“zzzzz”) a lot!
Large margin estimation • Given training example (x, y*), we want: • Maximize margin • Mistake weighted margin: # of mistakes in y *Taskar et al. 03
Large margin estimation • Brute force enumeration • Min-max formulation • ‘Plug-in’ linear program for inference
Min-max formulation Assume linear loss (Hamming): Inference LP inference
Min-max formulation By strong LP duality Minimize jointly over w, z
Min-max formulation • Formulation produces compact QP for • Low-treewidth Markov networks • Associative Markov networks • Context free grammars • Bipartite matchings • Any problem with compact LP inference
3D Mapping Data provided by: Michael Montemerlo & Sebastian Thrun Laser Range Finder GPS IMU Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points
Segmentation results Hand labeled 180K test points
2 3 1 4 6 5 Certificate formulation • Non-bipartite matchings: • O(n3) combinatorial algorithm • No polynomial-size LP known • Spanning trees • No polynomial-size LP known • Simple certificate of optimality • Intuition: • Verifying optimality easier than optimizing • Compact optimality condition of y* wrt. kl ij
2 3 1 4 6 5 Certificate for non-bipartite matching Alternating cycle: • Every other edge is in matching Augmenting alternating cycle: • Score of edges not in matching greater than edges in matching Negate score of edges not in matching • Augmenting alternating cycle = negative length alternating cycle Matching is optimal no negative alternating cycles Edmonds ‘65
2 3 1 4 6 5 Certificate for non-bipartite matching Pick any node r as root = length of shortest alternating path from r to j Triangle inequality: Theorem: No negative length cycle distance function d exists Can be expressed as linear constraints: O(n) distance variables, O(n2) constraints
Certificate formulation • Formulation produces compact QP for • Spanning trees • Non-bipartite matchings • Any problem with compact optimality condition
Disulfide connectivity prediction • Dataset • Swiss Prot protein database, release 39 • Fariselli & Casadio 01, Baldi et al. 04 • 446 sequences (4-50 cysteines) • Features: window profiles (size 9) around each pair • Two modes: bonded state known/unknown • Comparison: • SVM-trained weights (ignoring constraints during learning) • DAG Recursive Neural Network [Baldi et al. 04] • Our model: • Max-margin matching using RBF kernel • Training: off-the-shelf LP/QP solver CPLEX (~1 hour)
Known bonded state Precision / Accuracy 4-fold cross-validation
Unknown bonded state Precision / Recall / Accuracy 4-fold cross-validation
Formulation summary • Brute force enumeration • Min-max formulation • ‘Plug-in’ convex program for inference • Certificate formulation • Directly guarantee optimality of y*
Estimation Margin Discriminative MEMMs CRFs P(y|x) HMMs PCFGs MRFs Generative P(x,y) Local Global P(z) = 1/Z c (zc) P(z) = iP(zi|z)
Omissions • Formulation details • Kernels • Multiple examples • Slacks for non-separable case • Approximate learning of intractable models • General MRFs • Learning to cluster • Structured generalization bounds • Scalable algorithms (no QP solver needed) • Structured SMO (works for chains, trees) • Structured EG (works for chains, trees) • Structured PG (works for chains, matchings, AMNs, …)
Current Work • Learning approximate energy functions • Protein folding • Physical processes • Semi-supervised learning • Hidden variables • Mixing labeled and unlabeled data • Discriminative structure learning • Using sparsifying priors
Conclusion • Two general techniques for structured large-margin estimation • Exact, compact, convex formulations • Allow efficient use of kernels • Tractable when other estimation methods are not • Structured generalization bounds • Efficient learning algorithms • Empirical success on many domains • Papers at http://www.cs.berkeley.edu/~taskar
Duals and Kernels • Kernel trick works! • Scoring functions (log-potentials) can use kernels • Same for certificate formulation
raw pixels quadratic kernel cubic kernel Handwriting Recognition Length: ~8 chars Letter: 16x8 pixels 10-fold Train/Test 5000/50000 letters 600/6000 words Models: Multiclass-SVMs* CRFs M3 nets 30 better 25 20 Test error (average per-character) 15 10 45% error reduction over linear CRFs 33% error reduction over multiclass SVMs 5 0 MC–SVMs M^3 nets CRFs *Crammer & Singer 01