Learning Structured Prediction Models: A Large Margin Approach

Learning Structured Prediction Models:A Large Margin Approach Ben Taskar U.C. Berkeley Vassil Chatalbashev Michael Collins Carlos Guestrin Dan Klein Daphne Koller Chris Manning

“Don’t worry, Howard. The big questions are multiple choice.”

Handwriting recognition x y brace Sequential structure

Object segmentation x y Spatial structure

Natural language parsing x y The screen was a sea of red Recursive structure

Disulfide connectivity prediction x y RSCCPCYWGGCPW GQNCYPEGCSGPKV Combinatorial structure

Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Geometric View • Structured model polytopes • Linear programming inference • Structured large margin estimation • Min-max formulation • Application: 3D object segmentation • Certificate formulation • Application: disulfide connectivity prediction

Structured models Mild assumption: linear combination scoring function space of feasible outputs

a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) P(y|x)  i (xi,yi)i (yi,yi+1) (xi,yi) = exp{ wf(xi,yi)} (yi,yi+1)= exp{ wf (yi,yi+1)} f(y,y’) = I(y=‘z’,y’=‘a’) y f(x,y) = I(xp=1, y=‘z’) x *Lafferty et al. 01

a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) P(y|x)i (xi,yi)i (yi,yi+1) = exp{wTf(x,y)} w = [… ,w, … , w, …] f(x,y) = [… ,f(x,y), … , f(x,y), …] i(xi,yi) = exp{ w if(xi,yi)} i(yi,yi+1)= exp{ w if (yi,yi+1)} f(x,y) = #(y=‘z’,y’=‘a’) y f(x,y) = #(xp=1, y=‘z’) x *Lafferty et al. 01

Associative Markov Nets Edge features Point features spin-images, point height length of edge, edge orientation “associative” restriction i yi ij yj

PCFG #(NP  DT NN) … #(PP  IN NP) … #(NN  ‘sea’)

2 3 1 4 6 5 Disulfide bonds: non-bipartite matching RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 1 6 2 5 4 3 Fariselli & Casadio `01, Baldi et al. ‘04

2 3 1 4 6 5 Scoring function RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 String features: residues, physical properties

Structured models Mild assumption: Another mild assumption:  linear programming scoring function space of feasible outputs

MAP inference  linear program • LP inference for • Chains • Trees • Associative Markov Nets • Bipartite Matchings • …

Markov Net Inference LP Has integral solutions y for chains, trees Gives upper bound for general networks

Associative MN Inference LP “associative” restriction • For K=2, solutions are always integral (optimal) • For K>2, within factor of 2 of optimal • Constraint matrix A is linear in number of nodes and edges, regardless of the tree-width

Other Inference LPs • Context-free parsing • Dynamic programs • Bipartite matching • Network flow • Many other combinatorial problems

Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Geometric View • Structured model polytopes • Linear programming inference • Structured large margin estimation • Min-max formulation • Application: 3D object segmentation • Certificate formulation • Application: disulfide connectivity prediction

Learning w • Training example (x, y*) • Probabilistic approach: • Maximize conditional likelihood • Problem: computing Zw(x) is #P-complete

Geometric Example Training data: Goal: Learn w s.t.wTf( , y*) points the “right” way

OCR Example • We want: argmaxword wT f(,word) = “brace” • Equivalently: wTf(,“brace”) > wTf( ,“aaaaa”) wTf(,“brace”) > wTf( ,“aaaab”) … wTf(,“brace”) > wTf( ,“zzzzz”) a lot!

Large margin estimation • Given training example (x, y*), we want: • Maximize margin • Mistake weighted margin: # of mistakes in y *Taskar et al. 03

Large margin estimation • Brute force enumeration • Min-max formulation • ‘Plug-in’ linear program for inference

Min-max formulation Assume linear loss (Hamming): Inference LP inference

Min-max formulation By strong LP duality Minimize jointly over w, z

Min-max formulation • Formulation produces compact QP for • Low-treewidth Markov networks • Associative Markov networks • Context free grammars • Bipartite matchings • Any problem with compact LP inference

3D Mapping Data provided by: Michael Montemerlo & Sebastian Thrun Laser Range Finder GPS IMU Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points

Segmentation results Hand labeled 180K test points

Fly-through

2 3 1 4 6 5 Certificate formulation • Non-bipartite matchings: • O(n3) combinatorial algorithm • No polynomial-size LP known • Spanning trees • No polynomial-size LP known • Simple certificate of optimality • Intuition: • Verifying optimality easier than optimizing • Compact optimality condition of y* wrt. kl ij

2 3 1 4 6 5 Certificate for non-bipartite matching Alternating cycle: • Every other edge is in matching Augmenting alternating cycle: • Score of edges not in matching greater than edges in matching Negate score of edges not in matching • Augmenting alternating cycle = negative length alternating cycle Matching is optimal no negative alternating cycles Edmonds ‘65

2 3 1 4 6 5 Certificate for non-bipartite matching Pick any node r as root = length of shortest alternating path from r to j Triangle inequality: Theorem: No negative length cycle distance function d exists Can be expressed as linear constraints: O(n) distance variables, O(n2) constraints

Certificate formulation • Formulation produces compact QP for • Spanning trees • Non-bipartite matchings • Any problem with compact optimality condition

Disulfide connectivity prediction • Dataset • Swiss Prot protein database, release 39 • Fariselli & Casadio 01, Baldi et al. 04 • 446 sequences (4-50 cysteines) • Features: window profiles (size 9) around each pair • Two modes: bonded state known/unknown • Comparison: • SVM-trained weights (ignoring constraints during learning) • DAG Recursive Neural Network [Baldi et al. 04] • Our model: • Max-margin matching using RBF kernel • Training: off-the-shelf LP/QP solver CPLEX (~1 hour)

Known bonded state Precision / Accuracy 4-fold cross-validation

Unknown bonded state Precision / Recall / Accuracy 4-fold cross-validation

Formulation summary • Brute force enumeration • Min-max formulation • ‘Plug-in’ convex program for inference • Certificate formulation • Directly guarantee optimality of y*

Estimation Margin Discriminative MEMMs CRFs P(y|x) HMMs PCFGs MRFs Generative P(x,y) Local Global P(z) = 1/Z c (zc) P(z) = iP(zi|z)

Omissions • Formulation details • Kernels • Multiple examples • Slacks for non-separable case • Approximate learning of intractable models • General MRFs • Learning to cluster • Structured generalization bounds • Scalable algorithms (no QP solver needed) • Structured SMO (works for chains, trees) • Structured EG (works for chains, trees) • Structured PG (works for chains, matchings, AMNs, …)

Current Work • Learning approximate energy functions • Protein folding • Physical processes • Semi-supervised learning • Hidden variables • Mixing labeled and unlabeled data • Discriminative structure learning • Using sparsifying priors

Conclusion • Two general techniques for structured large-margin estimation • Exact, compact, convex formulations • Allow efficient use of kernels • Tractable when other estimation methods are not • Structured generalization bounds • Efficient learning algorithms • Empirical success on many domains • Papers at http://www.cs.berkeley.edu/~taskar

Duals and Kernels • Kernel trick works! • Scoring functions (log-potentials) can use kernels • Same for certificate formulation

raw pixels quadratic kernel cubic kernel Handwriting Recognition Length: ~8 chars Letter: 16x8 pixels 10-fold Train/Test 5000/50000 letters 600/6000 words Models: Multiclass-SVMs* CRFs M3 nets 30 better 25 20 Test error (average per-character) 15 10 45% error reduction over linear CRFs 33% error reduction over multiclass SVMs 5 0 MC–SVMs M^3 nets CRFs *Crammer & Singer 01

Learning Structured Prediction Models: A Large Margin Approach

Learning Structured Prediction Models: A Large Margin Approach

Presentation Transcript

Summary of part I: prediction and RL

Age-structured models

A Multi-Parameter Approach to Lightning Prediction

FARIMA( p,d,q ) Model and Application

The Traditional Approach To Design

Efficient Decomposed Learning for Structured Prediction

Learning for Structured Prediction Overview of the Material

Chapter 4 Prediction, Goodness-of-fit, and Modeling Issues

Structured Learning Conversations

Part III Learning structured representations Hierarchical Bayesian models

Structured Output Prediction with Structural Support Vector Machines

Verification of a downscaling approach for large area flood prediction over the Ohio River Basin

Affine Motion-compensated Prediction

Structured Prediction: A Large Margin Approach

Advanced Algorithms and Models for Computational Biology -- a machine learning approach

Introduction to Ensemble Prediction Systems (EPS)

Local Analysis and Prediction System (LAPS) Technology Transfer

Magic Moments: Moment-based Approaches to Structured Output Prediction

Semi-supervised Structured Prediction Models

Introduction

Exploring Massive Learning via a Prediction System