750 likes | 906 Views
Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania Joint work with: V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, D. Koller, S. Lacoste-Julien, C. Manning. “Don’t worry, Howard. The big questions are multiple choice.”.
E N D
Structured Prediction:A Large Margin Approach Ben Taskar University of Pennsylvania Joint work with: V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, D. Koller, S. Lacoste-Julien, C. Manning
“Don’t worry, Howard. The big questions are multiple choice.”
Handwriting Recognition x y brace Sequential structure
Object Segmentation x y Spatial structure
Natural Language Parsing x y The screen was a sea of red Recursive structure
Bilingual Word Alignment En vertu de les nouvelles propositions , quel est le coût prévu de perception de les droits ? x y What is the anticipated cost of collecting fees under the new proposal ? What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? Combinatorial structure
Protein Structure and Disulfide Bridges AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHKIPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK Protein: 1IMT
Local Prediction Classify using local information Ignores correlations & constraints! b r a c e
building tree shrub ground Local Prediction
Structured Prediction • Use local information • Exploit correlations b r a c e
building tree shrub ground Structured Prediction
Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Structured large margin estimation • Margins and structure • Min-max formulation • Linear programming inference • Certificate formulation
Structured Models Mild assumption: linear combination scoring function space of feasible outputs
a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) y x *Lafferty et al. 01
a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) y x *Lafferty et al. 01
Associative Markov Nets Edge features Point features spin-images, point height length of edge, edge orientation “associative” restriction j yj jk yk
CFG Parsing #(NP DT NN) … #(PP IN NP) … #(NN ‘sea’)
Bilingual Word Alignment • position • orthography • association En vertu de les nouvelles propositions , quel est le coût prévu de perception de le droits ? What is the anticipated cost of collecting fees under the new proposal ? k j
Disulfide Bonds: Non-bipartite Matching RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 2 3 1 5 1 4 2 6 4 6 5 3 Fariselli & Casadio `01, Baldi et al. ‘04
2 3 1 4 6 5 Scoring Function RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 • amino acid identities • phys/chem properties
Structured Models Mild assumptions: linear combination sum of part scores scoring function space of feasible outputs
Supervised Structured Prediction Model: Prediction Learning Data Estimatew Example: Weighted matching Generally: Combinatorialoptimization Local (ignores structure) Margin Likelihood (intractable)
Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Structured large margin estimation • Margins and structure • Min-max formulation • Linear programming inference • Certificate formulation
OCR Example • We want: • Equivalently: “brace” “brace” “aaaaa” “brace” “aaaab” a lot! … “brace” “zzzzz”
S S A E B F G C H D S S S S A A A A B B B B S C C C C D D D D A B D F Parsing Example • We want: • Equivalently: ‘It was red’ ‘It was red’ ‘It was red’ ‘It was red’ ‘It was red’ a lot! … ‘It was red’ ‘It was red’
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Alignment Example • We want: • Equivalently: ‘What is the’ ‘Quel est le’ 1 2 3 1 2 3 ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ a lot! … ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’
S S B B D E A A C C 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 S A B S C D A E C D 1 2 3 1 2 3 Structured Loss b c a r e 2 b r o r e 2 b r o c e 1 b r a c e 0 0 1 2 2 0 1 2 3 ‘What is the’ ‘Quel est le’ ‘It was red’
Large margin estimation • Given training examples , we want: • Maximize margin • Mistake weighted margin: # of mistakes in y *Collins 02, Altun et al 03, Taskar 03
Large margin estimation • Eliminate • Add slacks for inseparable case
Large margin estimation • Brute force enumeration • Min-max formulation • ‘Plug-in’ linear program for inference
Min-max formulation Structured loss (Hamming): Inference LP Inference Key step: discrete optim. continuous optim.
Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Structured large margin estimation • Margins and structure • Min-max formulation • Linear programming inference • Certificate formulation
Markov Net Inference LP normalization agreement Has integral solutions z for chains, trees Can be fractional for untriangulated networks
Associative MN Inference LP “associative” restriction • For K=2, solutions are always integral (optimal) • For K>2, within factor of 2 of optimal
CFG Chart • CNF tree = set of two types of parts: • Constituents (A, s, e) • CF-rules (A B C, s, m, e)
CFG Inference LP root inside outside Has integral solutions z
Matching Inference LP En vertu de les nouvelles propositions , quel est le coût prévu de perception de le droits ? k What is the anticipated cost of collecting fees under the new proposal ? degree j Has integral solutions z
LP Duality • Linear programming duality • Variables constraints • Constraints variables • Optimal values are the same • When both feasible regions are bounded
Min-max Formulation LP duality
Min-max formulation summary • Formulation produces concise QP for • Low-treewidth Markov networks • Associative MNs (K=2) • Context free grammars • Bipartite matchings • Approximate for untriangulated MNs, AMNs with K>2 *Taskar et al 04
Unfactored Primal/Dual QP duality Exponentially many constraints/variables
Factored Primal/Dual By QP duality Dual inherits structure from problem-specific inference LP Variables correspond to a decomposition of variables of the flat case
The Connection b c a r e 2 .2 b r o r e 2 .15 b r o c e .25 1 b r a c e .4 0 r c a 1 1 .65 .8 .6 e b c r o .4 .35 .2
Duals and Kernels • Kernel trick works: • Factored dual • Local functions (log-potentials) can use kernels
Alternatives: Perceptron • Simple iterative method • Unstable for structured output: fewer instances, big updates • May not converge if non-separable • Noisy • Voted / averaged perceptron [Freund & Schapire 99, Collins 02] • Regularize / reduce variance by aggregating over iterations
Alternatives: Constraint Generation • Add most violated constraint • Handles more general loss functions • Only polynomial # of constraints needed • Need to re-solve QP many times • Worst case # of constraints larger than factored [Collins 02; Altun et al, 03; Tsochantaridis et al, 04]
raw pixels quadratic kernel cubic kernel Handwriting Recognition Length: ~8 chars Letter: 16x8 pixels 10-fold Train/Test 5000/50000 letters 600/6000 words Models: Multiclass-SVMs* CRFs M3 nets 30 better 25 20 Test error (average per-character) 15 10 45% error reduction over linear CRFs 33% error reduction over multiclass SVMs 5 0 MC–SVMs M^3 nets CRFs *Crammer & Singer 01
Hypertext Classification • WebKB dataset • Four CS department websites: 1300 pages/3500 links • Classify each page: faculty, course, student, project, other • Train on three universities/test on fourth better relaxed dual 53% errorreduction over SVMs 38% error reduction over RMNs loopy belief propagation *Taskar et al 02
3D Mapping Data provided by: Michael Montemerlo & Sebastian Thrun Laser Range Finder GPS IMU Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points