1 / 46

Multiclass Classification in NLP

Multiclass Classification in NLP. Name/Entity Recognition Label people, locations, and organizations in a sentence [PER Sam Houston] ,[born in] [LOC Virginia] , [was a member of the] [ORG US Congress] . Decompose into sub-problems

damien
Download Presentation

Multiclass Classification in NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiclass Classification in NLP • Name/Entity Recognition • Label people, locations, and organizations in a sentence • [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US Congress]. • Decompose into sub-problems • Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  PER (1) • Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  None (0) • Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  LOC (2) • Many problems in NLP are decomposed this way • Disambiguation tasks • POS Tagging • Word-sense disambiguation • Verb Classification • Semantic-Role Labeling

  2. Outline • Multi-Categorical Classification Tasks • example: Semantic Role Labeling (SRL) • Decomposition Approaches • Constraint Classification • Unifies learning of multi-categorical classifiers • Structured-Output Learning • revisit SRL • Decomposition versus Constraint Classification • Goal: • Discuss multi-class and structured output from the same perspective. • Discuss similarities and differences

  3. Multi-Categorical Output Tasks • Multi-class Classification (y {1,...,K}) character recognition (‘6’) document classification (‘homepage’) • Multi-label Classification (y {1,...,K}) document classification (‘(homepage,facultypage)’) • Category Ranking (yK) user preference (‘(love > like > hate)’) document classification (‘hompage > facultypage > sports’) • Hierarchical Classification (y {1,..,K}) cohere with class hierarchy place document into index where ‘soccer’ is-a ‘sport’

  4. (more) Multi-Categorical Output Tasks • Sequential Prediction (y {1,...,K}+) e.g. POS tagging (‘(NVNNA)’) “This is a sentence.”  D V N D e.g. phrase identification Many labels: KL for length L sentence • Structured Output Prediction (yC({1,...,K}+)) e.g. parse tree, multi-level phrase identification e.g. sequential prediction Constrained by domain, problem, data, background knowledge, etc...

  5. A0 : leaver A2: benefactor Ileftmy pearlsto my daughter-in-lawin my will. A1: thing left AM-LOC Semantic Role LabelingA Structured-Output Problem • For each verb in a sentence • Identify all constituents that fill a semantic role • Determine their roles • Core Arguments, e.g., Agent, Patient or Instrument • Their adjuncts, e.g., Locative, Temporal or Manner

  6.  A0-A1A2AM-LOC Semantic Role Labeling • Many possible valid output • Many possible invalid output I left my pearls to my daughter-in-law in my will.

  7. Structured Output Problems • Multi-Class • View y=4 as (y1,...,yk) = ( 0 0 0 1 0 0 0 ) • The output is restricted by “Exactly one of yi=1” • Learn f1(x),..,fk(x) • Sequence Prediction • e.g. POS tagging: x = (My name is Dav) y = (Pr,N,V,N) • e.g. restriction: “Every sentence must have a verb” • Structured Output • Arbitrary global constraints • Local functions do not have access to global constraints! • Goal: • Discuss multi-class and structured output from the same perspective. • Discuss similarities and differences

  8. Transform the sub-problems • Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  PER (1) • Transform each problem to feature vector • Sam Houston, born in Virginia  (Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN, --BORN,... )  ( 0 , 0 , 1 , 0 , 1 , 1 ,... ) • Transform each label to a class label • PER 1 • LOC 2 • ORG 3 • ? 0 • Input : {0,1}d or Rd • Output: {0,1,2,3,...,k}

  9. Solving multiclass with binary learning • Multiclass classifier • Function f : Rd {1,2,3,...,k} • Decompose into binary problems • Not always possible to learn • No theoretical justification (unless the problem is easy)

  10. The Real MultiClass Problem • General framework • Extend binary algorithms • Theoretically justified • Provably correct • Generalizes well • Verified Experimentally • Naturally extends binary classification algorithms to mulitclass setting • e.g. Linear binary separation induces linear boundaries in multiclass setting

  11. Multi Class over Linear Functions • One versus all (OvA) • All versus all (AvA) • Direct winner-take-all (D-WTA)

  12. WTA over linear functions • Assume examples generated from winner-take-all • y = argmax wi . x + ti • wi, xRn , tiR • Note: Voronoi diagrams are WTA functions • argminc || ci – x || = argmaxcci . x – ||ci||2 / 2

  13. Learning via One-Versus-All (OvA) Assumption • Find vr,vb,vg,vy Rn such that • vr.x > 0 iff y = red  • vb.x > 0 iff y = blue  • vg.x > 0 iff y = green  • vy.x > 0 iff y = yellow  • Classifier f(x) = argmax vi.x H = Rkn Individual Classifiers Decision Regions

  14. H = Rkkn How to classify? Learning via All-Verses-All (AvA) Assumption • Find vrb,vrg,vry,vbg,vby,vgyRd such that • vrb.x > 0 if y = red < 0 if y = blue • vrg.x > 0 if y = red < 0 if y = green • ... (for all pairs) Individual Classifiers Decision Regions

  15. Tree Majority Vote 1 red, 2 yellow, 2 green  ? Tournament Classifying with AvA All are post-learning and might cause weird stuff

  16. Summary (1): Learning Binary Classifiers • On-Line: Perceptron, Winnow • Mistake bounded • Generalizes well (VC-Dim) • Works well in practice • SVM • Well motivated to maximize margin • Generalizes well • Works well in practice • Boosting, Neural Networks, etc...

  17. From Binary to Multi-categorical • Decompose multi-categorical problems • into multiple (independent) binary problems • Multi-class: OvA, AvA, ECOC, DT, etc... • Multi-label: reduce to multi-class • Categorical Ranking: reduce or regression • Sequence Prediction: • Reduce to Multi-class • part/alphabet based decompositions • Structured Output: • learn parts of output based on local information!!!

  18. Problems with Decompositions • Learning optimizes over local metrics • Poor global performance • What is the metric? • We don’t care about the performance of the local classifiers • Poor decomposition  poor performance • Difficult local problems • Irrelevant local problems • Not clear how to decompose all Multi-category problems

  19. Multi-class OvA Decomposition: a Linear Representation • Hypothesis: h(x) = argmaxi vix • Decomposition • Each class represented by a linear function vix • Learning: One-versus-all (OvA) • For each class i vix > 0 iff i=y • General Case • Each class represented by a function fi(x) > 0

  20. Learning via One-Versus-All (OvA) Assumption • Classifier f(x) = argmax vi.x Individual Classifiers • OvA Learning: Find vi.x > 0 iff y=i • OvA is fine only if data is OvA separable! • Linear classifier can represent this function! • (voronoi) argmin d(ci,x) (wta) argmax cix + di

  21. Other Issues we Mentioned • Error Correcting Output Codes • Another (class of) decomposition • Difficulty: how to make sure that the resulting problems are separable. • Commented on the advantage of All vs. All when working with the dual space (e.g., kernels)

  22. Example: SNoW Multi-class Classifier How do we train? How do we evaluate? Targets (each an LTU) Weighted edges (weight vectors) Features • SNoW only represents the targets and weighted edges

  23. Winnow: Extensions • Winnow learns monotone boolean functions • To learn non-monotone boolean functions: • For each variable x, introduce x’ = ¬x • Learn monotone functions over 2nvariables • To learn functions with real valued inputs: • “Balanced Winnow” • 2 weights per variable; effective weight is the difference • Update rule:

  24. An Intuition: Balanced Winnow • In most multi-class classifiers you have a target node that represents positive examples and target node that represents negative examples. • Typically, we train each node separately (my/not my example). • Rather, given an example we could say: this is more a + example than a – example. • We compared the activation of the different target nodes (classifiers) on a given example. (This example is more class + than class -)

  25. Constraint Classification • Can be viewed as a generalization of the balanced Winnow to the multi-class case • Unifies multi-class, multi-label, category-ranking • Reduces learning to a single binary learning task • Captures theoretical properties of binary algorithm • Experimentally verified • Naturally extends Perceptron, SVM, etc... • Do all of this by representing labels as a set of constraints or preferences among output labels.

  26. Multi-category to Constraint Classification • Multiclass • (x, A)  (x, (A>B, A>C, A>D) ) • Multilabel • (x, (A, B))  (x, ( (A>C, A>D, B>C, B>D) ) • Label Ranking • (x, (5>4>3>2>1))  (x, ( (5>4, 4>3, 3>2, 2>1) ) • Examples (x,y) y Sk • Sk : partial order over class labels {1,...,k} • defines “preference” relation ( > ) for class labeling • Constraint Classifier h: XSk

  27. 2>4 2>1 2>3 2>1 2>3 2>4 Learning Constraint ClassificationKesler Construction • Transform Examples i>j fi(x) - fj(x) > 0 wi x - wj x > 0 WXi - WXj > 0 W (Xi - Xj) > 0 WXij > 0 Xi = (0,x,0,0) Rkd Xj = (0,0,0,x) Rkd Xij = Xi -Xj = (0,x,0,-x) W = (w1,w2,w3,w4) Rkd

  28. Kesler’s Construction (1) • y = argmaxi=(r,b,g,y) vi.x • vi , xRn • Find vr,vb,vg,vy Rn such that • vr.x > vb.x • vr.x > vg.x • vr.x > vy.x H = Rkn

  29. x -x -x x Kesler’s Construction (2) • Let v = (vr,vb,vg,vy ) Rkn • Let 0n, be the n-dim zero vector • vr.x > vb.x  v.(x,-x,0n,0n) > 0  v.(-x,x,0n,0n) < 0 • vr.x > vg.x  v.(x,0n,-x,0n) > 0  v.(x,0n,-x,0n) < 0 • vr.x > vy.x  v.(x,0n,0n,-x) > 0  v.(-x,0n,0n ,x) < 0

  30. x -x Kesler’s Construction (3) • Let • v = (v1, ..., vk) Rn x ... x Rn = Rkn • xij = (0(i-1)n, x, 0(k-i)n) – (0(j-1)n, –x, 0(k-j)n) Rkn • Given (x, y) Rn x {1,...,k} • For all j  y • Add to P+(x,y), (xyj, 1) • Add to P-(x,y), (–xyj, -1) • P+(x,y) has k-1 positive examples (Rkn) • P-(x,y) has k-1 negative examples (Rkn)

  31. Learning via Kesler’s Construction • Given (x1, y1), ..., (xN, yN) Rn x {1,...,k} • Create • P+ = P+(xi,yi) • P– = P–(xi,yi) • Find v = (v1, ..., vk) Rkn, such that • v.x separates P+ from P– • Output • f(x) = argmax vi.x

  32. Constraint Classification • Examples (x,y) • y Sk • Sk : partial order over class labels {1,...,k} • defines “preference” relation (<) for class labels • e.g. Multiclass: 2<1, 2<3, 2<4, 2<5 • e.g. Multilabel: 1<3, 1<4, 1<5, 2<3, 2<4, 4<5 • Constraint Classifier • f: X Sk • f(x) is a partial order • f(x) is consistent with y if (i<j)  y  (i<j) f(x)

  33. Implementation • Examples (x,y) • y Sk • Sk : partial order over class labels {1,...,k} • defines “preference” relation (>) for class labels • e.g. Multiclass: 2>1, 2>3, 2>4, 2>5 • Given an example that is labeled 2, the activation of target 2 on it, should be larger than the activation of the other targets. • SNoW implementation: Conservative. • Only the target node that corresponds to the correct label and the highest activation are compared. • If both are the same target node – no change. • Otherwise, promote one and demote the other.

  34. Properties of Construction • Can learn any argmax vi.x function • Can use any algorithm to find linear separation • Perceptron Algorithm • ultraconservativeonline algorithm [Crammer, Singer 2001] • Winnow Algorithm • multiclasswinnow [ Masterharm 2000 ] • Defines a multiclass margin • by binary margin in Rkd • multiclass SVM [Crammer, Singer 2001]

  35. Margin Generalization Bounds • Linear Hypothesis space: • h(x) = argsort vi.x • vi, x Rd • argsort returns permutation of {1,...,k} • CC margin-based bound •  = min(x,y)S min (i < j)y vi.x – vj.x • m - number of examples • R - maxx ||x|| •  - confidence • C - average # constraints

  36. VC-style Generalization Bounds • Linear Hypothesis space: • h(x) = argsort vi.x • vi, x Rd • argsort returns permutation of {1,...,k} • CC VC-based bound • m - number of examples • d - dimension of input space • delta - confidence • k - number of classes

  37. Beyond Multiclass Classification • Ranking • category ranking (over classes) • ordinal regression (over examples) • Multilabel • x is both red and blue • Complex relationships • x is more red than blue, but not green • Millions of classes • sequence labeling (e.g. POS tagging) LATER • SNoW has an implementation of Constraint Classification for the Multi-Class case. Try to compare with 1-vs-all. • Experimental Issues: when is this version of multi-class better? • Several easy improvements are possible via modifying the loss function.

  38. Multi-class Experiments Picture isn’t so clear for very high dimensional problems. Why?

  39. Summary

  40. Structured Output Learning • Abstract View: • Decomposition versus Constraint Classification • More details: Inference with Classifiers

  41. A2: benefactor A0 : leaver A1: thing left Structured Output Learning:Semantic Role Labeling • For each verb in a sentence • Identify all constituents that fill a semantic role • Determine their roles • Core Arguments, e.g., Agent, Patient or Instrument • Their adjuncts, e.g., Locative, Temporal or Manner Y : All possible ways to label the tree C(Y): All valid ways to label the tree argmaxy  C(Y) g(x,y) my child I left my pearls to

  42. y1 y2 y3 my child I left my pearls to Components of Structured Output Learning • Input: X • Output: A collection of variables • Y = (y1,...,yL)  {1,...,K}L • Length is example dependent • Constraints on the Output C(Y) • e.g. non-overlapping, no repeated values... • partition output to valid and invalid assignments • Representation • scoring function: g(x,y) • e.g. linear: g(x,y) = w (x,y) • Inference • h(x) = argmaxvalid yg(x,y) Y X

  43. Decomposition-based Learning • Many choices for decomposition • Depends on problem, learning model, computation resources, etc... • Value-based decomposition • A function for each output value • fk(x,l), k = {1,..,K} • e.g. SRL tagging fA0(x,node), fA1(x,node),... • OvA learning • fk(x,node) > 0 iff k=y

  44. Learning Discriminant Functions: The General Setting • g(x,y) > g(x,y’)y’ Y \ y • w (x,y) > w (x,y’) y’ Y \ y • w (x,y,y’) = w  ((x,y) - (x,y’)) > 0 • P(x,y) = {(x,y,y’)} y’  Y \ y • P(S) = {P(x,y)}(x,y) S • Learn unary classifer over P(S) (binary) (+P(S),-P(S)) • Used in many works [C02,WW00,CS01,CM03,TGK03]

  45. scoreNONE(3) scoreA2(13) my child I left my pearls to Structured Output Learning:Semantic Role Labeling • Learn a collection of “scoring” functions • wA0A0(x,y,n) , wA1A1(x,y,n),... • scorev(x,y,n) = wvv(x,y,n) • Global score • g(x,y) = n scoreyn(x,y,n) = nwynyn(x,y,n) • Learn locally (LO, L+I) • for each label variable (node) n = A0 • gA0(x,y,n) = wA0A0(x,y,n) > 0 iff yn = A0 • Discriminant model dictates: • g(x,y) > g(x,y’), y  C(Y) • argmaxy  C(Y) g(x,y) • Learn Globally (IBT) • g(x,y) = w (x,y)

  46. Summary Multi-class Structured Output

More Related