1 / 75

Structured Prediction: A Large Margin Approach

Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania Joint work with: V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, D. Koller, S. Lacoste-Julien, C. Manning. “Don’t worry, Howard. The big questions are multiple choice.”.

tevy
Download Presentation

Structured Prediction: A Large Margin Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structured Prediction:A Large Margin Approach Ben Taskar University of Pennsylvania Joint work with: V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, D. Koller, S. Lacoste-Julien, C. Manning

  2. “Don’t worry, Howard. The big questions are multiple choice.”

  3. Handwriting Recognition x y brace Sequential structure

  4. Object Segmentation x y Spatial structure

  5. Natural Language Parsing x y The screen was a sea of red Recursive structure

  6. Bilingual Word Alignment En vertu de les nouvelles propositions , quel est le coût prévu de perception de les droits ? x y What is the anticipated cost of collecting fees under the new proposal ? What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? Combinatorial structure

  7. Protein Structure and Disulfide Bridges AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHKIPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK Protein: 1IMT

  8. Local Prediction Classify using local information  Ignores correlations & constraints! b r a c e

  9. building tree shrub ground Local Prediction

  10. Structured Prediction • Use local information • Exploit correlations b r a c e

  11. building tree shrub ground Structured Prediction

  12. Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Structured large margin estimation • Margins and structure • Min-max formulation • Linear programming inference • Certificate formulation

  13. Structured Models Mild assumption: linear combination scoring function space of feasible outputs

  14. a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) y x *Lafferty et al. 01

  15. a-z a-z a-z a-z a-z Chain Markov Net (aka CRF*) y x *Lafferty et al. 01

  16. Associative Markov Nets Edge features Point features spin-images, point height length of edge, edge orientation “associative” restriction j yj jk yk

  17. CFG Parsing #(NP  DT NN) … #(PP  IN NP) … #(NN  ‘sea’)

  18. Bilingual Word Alignment • position • orthography • association En vertu de les nouvelles propositions , quel est le coût prévu de perception de le droits ? What is the anticipated cost of collecting fees under the new proposal ? k j

  19. Disulfide Bonds: Non-bipartite Matching RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 2 3 1 5 1 4 2 6 4 6 5 3 Fariselli & Casadio `01, Baldi et al. ‘04

  20. 2 3 1 4 6 5 Scoring Function RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 • amino acid identities • phys/chem properties

  21. Structured Models Mild assumptions: linear combination sum of part scores scoring function space of feasible outputs

  22. Supervised Structured Prediction Model: Prediction Learning Data Estimatew Example: Weighted matching Generally: Combinatorialoptimization Local (ignores structure) Margin Likelihood (intractable)

  23. Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Structured large margin estimation • Margins and structure • Min-max formulation • Linear programming inference • Certificate formulation

  24. OCR Example • We want: • Equivalently: “brace” “brace” “aaaaa” “brace” “aaaab” a lot! … “brace” “zzzzz”

  25. S S A E B F G C H D S S S S A A A A B B B B S C C C C D D D D A B D F Parsing Example • We want: • Equivalently: ‘It was red’ ‘It was red’ ‘It was red’ ‘It was red’ ‘It was red’ a lot! … ‘It was red’ ‘It was red’

  26. 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Alignment Example • We want: • Equivalently: ‘What is the’ ‘Quel est le’ 1 2 3 1 2 3 ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ a lot! … ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’

  27. S S B B D E A A C C 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 S A B S C D A E C D 1 2 3 1 2 3 Structured Loss b c a r e 2 b r o r e 2 b r o c e 1 b r a c e 0 0 1 2 2 0 1 2 3 ‘What is the’ ‘Quel est le’ ‘It was red’

  28. Large margin estimation • Given training examples , we want: • Maximize margin • Mistake weighted margin: # of mistakes in y *Collins 02, Altun et al 03, Taskar 03

  29. Large margin estimation • Eliminate • Add slacks for inseparable case

  30. Large margin estimation • Brute force enumeration • Min-max formulation • ‘Plug-in’ linear program for inference

  31. Min-max formulation Structured loss (Hamming): Inference LP Inference Key step: discrete optim. continuous optim.

  32. Outline • Structured prediction models • Sequences (CRFs) • Trees (CFGs) • Associative Markov networks (Special MRFs) • Matchings • Structured large margin estimation • Margins and structure • Min-max formulation • Linear programming inference • Certificate formulation

  33. y  z Map for Markov Nets

  34. Markov Net Inference LP normalization agreement Has integral solutions z for chains, trees Can be fractional for untriangulated networks

  35. Associative MN Inference LP “associative” restriction • For K=2, solutions are always integral (optimal) • For K>2, within factor of 2 of optimal

  36. CFG Chart • CNF tree = set of two types of parts: • Constituents (A, s, e) • CF-rules (A  B C, s, m, e)

  37. CFG Inference LP root inside outside Has integral solutions z

  38. Matching Inference LP En vertu de les nouvelles propositions , quel est le coût prévu de perception de le droits ? k What is the anticipated cost of collecting fees under the new proposal ? degree j Has integral solutions z

  39. LP Duality • Linear programming duality • Variables  constraints • Constraints  variables • Optimal values are the same • When both feasible regions are bounded

  40. Min-max Formulation LP duality

  41. Min-max formulation summary • Formulation produces concise QP for • Low-treewidth Markov networks • Associative MNs (K=2) • Context free grammars • Bipartite matchings • Approximate for untriangulated MNs, AMNs with K>2 *Taskar et al 04

  42. Unfactored Primal/Dual QP duality Exponentially many constraints/variables

  43. Factored Primal/Dual By QP duality Dual inherits structure from problem-specific inference LP Variables  correspond to a decomposition of  variables of the flat case

  44. The Connection b c a r e 2 .2 b r o r e 2 .15 b r o c e .25 1 b r a c e .4 0 r c a 1 1 .65 .8 .6 e b  c r o .4 .35 .2

  45. Duals and Kernels • Kernel trick works: • Factored dual • Local functions (log-potentials) can use kernels

  46. Alternatives: Perceptron • Simple iterative method • Unstable for structured output: fewer instances, big updates • May not converge if non-separable • Noisy • Voted / averaged perceptron [Freund & Schapire 99, Collins 02] • Regularize / reduce variance by aggregating over iterations

  47. Alternatives: Constraint Generation • Add most violated constraint • Handles more general loss functions • Only polynomial # of constraints needed • Need to re-solve QP many times • Worst case # of constraints larger than factored [Collins 02; Altun et al, 03; Tsochantaridis et al, 04]

  48. raw pixels quadratic kernel cubic kernel Handwriting Recognition Length: ~8 chars Letter: 16x8 pixels 10-fold Train/Test 5000/50000 letters 600/6000 words Models: Multiclass-SVMs* CRFs M3 nets 30 better 25 20 Test error (average per-character) 15 10 45% error reduction over linear CRFs 33% error reduction over multiclass SVMs 5 0 MC–SVMs M^3 nets CRFs *Crammer & Singer 01

  49. Hypertext Classification • WebKB dataset • Four CS department websites: 1300 pages/3500 links • Classify each page: faculty, course, student, project, other • Train on three universities/test on fourth better relaxed dual 53% errorreduction over SVMs 38% error reduction over RMNs loopy belief propagation *Taskar et al 02

  50. 3D Mapping Data provided by: Michael Montemerlo & Sebastian Thrun Laser Range Finder GPS IMU Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points

More Related