460 likes | 565 Views
Multiclass Classification in NLP. Name/Entity Recognition Label people, locations, and organizations in a sentence [PER Sam Houston] ,[born in] [LOC Virginia] , [was a member of the] [ORG US Congress] . Decompose into sub-problems
E N D
Multiclass Classification in NLP • Name/Entity Recognition • Label people, locations, and organizations in a sentence • [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US Congress]. • Decompose into sub-problems • Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1) • Sam Houston, born in Virginia... (PER,LOC,ORG,?) None (0) • Sam Houston, born in Virginia... (PER,LOC,ORG,?) LOC (2) • Many problems in NLP are decomposed this way • Disambiguation tasks • POS Tagging • Word-sense disambiguation • Verb Classification • Semantic-Role Labeling
Outline • Multi-Categorical Classification Tasks • example: Semantic Role Labeling (SRL) • Decomposition Approaches • Constraint Classification • Unifies learning of multi-categorical classifiers • Structured-Output Learning • revisit SRL • Decomposition versus Constraint Classification • Goal: • Discuss multi-class and structured output from the same perspective. • Discuss similarities and differences
Multi-Categorical Output Tasks • Multi-class Classification (y {1,...,K}) character recognition (‘6’) document classification (‘homepage’) • Multi-label Classification (y {1,...,K}) document classification (‘(homepage,facultypage)’) • Category Ranking (yK) user preference (‘(love > like > hate)’) document classification (‘hompage > facultypage > sports’) • Hierarchical Classification (y {1,..,K}) cohere with class hierarchy place document into index where ‘soccer’ is-a ‘sport’
(more) Multi-Categorical Output Tasks • Sequential Prediction (y {1,...,K}+) e.g. POS tagging (‘(NVNNA)’) “This is a sentence.” D V N D e.g. phrase identification Many labels: KL for length L sentence • Structured Output Prediction (yC({1,...,K}+)) e.g. parse tree, multi-level phrase identification e.g. sequential prediction Constrained by domain, problem, data, background knowledge, etc...
A0 : leaver A2: benefactor Ileftmy pearlsto my daughter-in-lawin my will. A1: thing left AM-LOC Semantic Role LabelingA Structured-Output Problem • For each verb in a sentence • Identify all constituents that fill a semantic role • Determine their roles • Core Arguments, e.g., Agent, Patient or Instrument • Their adjuncts, e.g., Locative, Temporal or Manner
A0-A1A2AM-LOC Semantic Role Labeling • Many possible valid output • Many possible invalid output I left my pearls to my daughter-in-law in my will.
Structured Output Problems • Multi-Class • View y=4 as (y1,...,yk) = ( 0 0 0 1 0 0 0 ) • The output is restricted by “Exactly one of yi=1” • Learn f1(x),..,fk(x) • Sequence Prediction • e.g. POS tagging: x = (My name is Dav) y = (Pr,N,V,N) • e.g. restriction: “Every sentence must have a verb” • Structured Output • Arbitrary global constraints • Local functions do not have access to global constraints! • Goal: • Discuss multi-class and structured output from the same perspective. • Discuss similarities and differences
Transform the sub-problems • Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1) • Transform each problem to feature vector • Sam Houston, born in Virginia (Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN, --BORN,... ) ( 0 , 0 , 1 , 0 , 1 , 1 ,... ) • Transform each label to a class label • PER 1 • LOC 2 • ORG 3 • ? 0 • Input : {0,1}d or Rd • Output: {0,1,2,3,...,k}
Solving multiclass with binary learning • Multiclass classifier • Function f : Rd {1,2,3,...,k} • Decompose into binary problems • Not always possible to learn • No theoretical justification (unless the problem is easy)
The Real MultiClass Problem • General framework • Extend binary algorithms • Theoretically justified • Provably correct • Generalizes well • Verified Experimentally • Naturally extends binary classification algorithms to mulitclass setting • e.g. Linear binary separation induces linear boundaries in multiclass setting
Multi Class over Linear Functions • One versus all (OvA) • All versus all (AvA) • Direct winner-take-all (D-WTA)
WTA over linear functions • Assume examples generated from winner-take-all • y = argmax wi . x + ti • wi, xRn , tiR • Note: Voronoi diagrams are WTA functions • argminc || ci – x || = argmaxcci . x – ||ci||2 / 2
Learning via One-Versus-All (OvA) Assumption • Find vr,vb,vg,vy Rn such that • vr.x > 0 iff y = red • vb.x > 0 iff y = blue • vg.x > 0 iff y = green • vy.x > 0 iff y = yellow • Classifier f(x) = argmax vi.x H = Rkn Individual Classifiers Decision Regions
H = Rkkn How to classify? Learning via All-Verses-All (AvA) Assumption • Find vrb,vrg,vry,vbg,vby,vgyRd such that • vrb.x > 0 if y = red < 0 if y = blue • vrg.x > 0 if y = red < 0 if y = green • ... (for all pairs) Individual Classifiers Decision Regions
Tree Majority Vote 1 red, 2 yellow, 2 green ? Tournament Classifying with AvA All are post-learning and might cause weird stuff
Summary (1): Learning Binary Classifiers • On-Line: Perceptron, Winnow • Mistake bounded • Generalizes well (VC-Dim) • Works well in practice • SVM • Well motivated to maximize margin • Generalizes well • Works well in practice • Boosting, Neural Networks, etc...
From Binary to Multi-categorical • Decompose multi-categorical problems • into multiple (independent) binary problems • Multi-class: OvA, AvA, ECOC, DT, etc... • Multi-label: reduce to multi-class • Categorical Ranking: reduce or regression • Sequence Prediction: • Reduce to Multi-class • part/alphabet based decompositions • Structured Output: • learn parts of output based on local information!!!
Problems with Decompositions • Learning optimizes over local metrics • Poor global performance • What is the metric? • We don’t care about the performance of the local classifiers • Poor decomposition poor performance • Difficult local problems • Irrelevant local problems • Not clear how to decompose all Multi-category problems
Multi-class OvA Decomposition: a Linear Representation • Hypothesis: h(x) = argmaxi vix • Decomposition • Each class represented by a linear function vix • Learning: One-versus-all (OvA) • For each class i vix > 0 iff i=y • General Case • Each class represented by a function fi(x) > 0
Learning via One-Versus-All (OvA) Assumption • Classifier f(x) = argmax vi.x Individual Classifiers • OvA Learning: Find vi.x > 0 iff y=i • OvA is fine only if data is OvA separable! • Linear classifier can represent this function! • (voronoi) argmin d(ci,x) (wta) argmax cix + di
Other Issues we Mentioned • Error Correcting Output Codes • Another (class of) decomposition • Difficulty: how to make sure that the resulting problems are separable. • Commented on the advantage of All vs. All when working with the dual space (e.g., kernels)
Example: SNoW Multi-class Classifier How do we train? How do we evaluate? Targets (each an LTU) Weighted edges (weight vectors) Features • SNoW only represents the targets and weighted edges
Winnow: Extensions • Winnow learns monotone boolean functions • To learn non-monotone boolean functions: • For each variable x, introduce x’ = ¬x • Learn monotone functions over 2nvariables • To learn functions with real valued inputs: • “Balanced Winnow” • 2 weights per variable; effective weight is the difference • Update rule:
An Intuition: Balanced Winnow • In most multi-class classifiers you have a target node that represents positive examples and target node that represents negative examples. • Typically, we train each node separately (my/not my example). • Rather, given an example we could say: this is more a + example than a – example. • We compared the activation of the different target nodes (classifiers) on a given example. (This example is more class + than class -)
Constraint Classification • Can be viewed as a generalization of the balanced Winnow to the multi-class case • Unifies multi-class, multi-label, category-ranking • Reduces learning to a single binary learning task • Captures theoretical properties of binary algorithm • Experimentally verified • Naturally extends Perceptron, SVM, etc... • Do all of this by representing labels as a set of constraints or preferences among output labels.
Multi-category to Constraint Classification • Multiclass • (x, A) (x, (A>B, A>C, A>D) ) • Multilabel • (x, (A, B)) (x, ( (A>C, A>D, B>C, B>D) ) • Label Ranking • (x, (5>4>3>2>1)) (x, ( (5>4, 4>3, 3>2, 2>1) ) • Examples (x,y) y Sk • Sk : partial order over class labels {1,...,k} • defines “preference” relation ( > ) for class labeling • Constraint Classifier h: XSk
2>4 2>1 2>3 2>1 2>3 2>4 Learning Constraint ClassificationKesler Construction • Transform Examples i>j fi(x) - fj(x) > 0 wi x - wj x > 0 WXi - WXj > 0 W (Xi - Xj) > 0 WXij > 0 Xi = (0,x,0,0) Rkd Xj = (0,0,0,x) Rkd Xij = Xi -Xj = (0,x,0,-x) W = (w1,w2,w3,w4) Rkd
Kesler’s Construction (1) • y = argmaxi=(r,b,g,y) vi.x • vi , xRn • Find vr,vb,vg,vy Rn such that • vr.x > vb.x • vr.x > vg.x • vr.x > vy.x H = Rkn
x -x -x x Kesler’s Construction (2) • Let v = (vr,vb,vg,vy ) Rkn • Let 0n, be the n-dim zero vector • vr.x > vb.x v.(x,-x,0n,0n) > 0 v.(-x,x,0n,0n) < 0 • vr.x > vg.x v.(x,0n,-x,0n) > 0 v.(x,0n,-x,0n) < 0 • vr.x > vy.x v.(x,0n,0n,-x) > 0 v.(-x,0n,0n ,x) < 0
x -x Kesler’s Construction (3) • Let • v = (v1, ..., vk) Rn x ... x Rn = Rkn • xij = (0(i-1)n, x, 0(k-i)n) – (0(j-1)n, –x, 0(k-j)n) Rkn • Given (x, y) Rn x {1,...,k} • For all j y • Add to P+(x,y), (xyj, 1) • Add to P-(x,y), (–xyj, -1) • P+(x,y) has k-1 positive examples (Rkn) • P-(x,y) has k-1 negative examples (Rkn)
Learning via Kesler’s Construction • Given (x1, y1), ..., (xN, yN) Rn x {1,...,k} • Create • P+ = P+(xi,yi) • P– = P–(xi,yi) • Find v = (v1, ..., vk) Rkn, such that • v.x separates P+ from P– • Output • f(x) = argmax vi.x
Constraint Classification • Examples (x,y) • y Sk • Sk : partial order over class labels {1,...,k} • defines “preference” relation (<) for class labels • e.g. Multiclass: 2<1, 2<3, 2<4, 2<5 • e.g. Multilabel: 1<3, 1<4, 1<5, 2<3, 2<4, 4<5 • Constraint Classifier • f: X Sk • f(x) is a partial order • f(x) is consistent with y if (i<j) y (i<j) f(x)
Implementation • Examples (x,y) • y Sk • Sk : partial order over class labels {1,...,k} • defines “preference” relation (>) for class labels • e.g. Multiclass: 2>1, 2>3, 2>4, 2>5 • Given an example that is labeled 2, the activation of target 2 on it, should be larger than the activation of the other targets. • SNoW implementation: Conservative. • Only the target node that corresponds to the correct label and the highest activation are compared. • If both are the same target node – no change. • Otherwise, promote one and demote the other.
Properties of Construction • Can learn any argmax vi.x function • Can use any algorithm to find linear separation • Perceptron Algorithm • ultraconservativeonline algorithm [Crammer, Singer 2001] • Winnow Algorithm • multiclasswinnow [ Masterharm 2000 ] • Defines a multiclass margin • by binary margin in Rkd • multiclass SVM [Crammer, Singer 2001]
Margin Generalization Bounds • Linear Hypothesis space: • h(x) = argsort vi.x • vi, x Rd • argsort returns permutation of {1,...,k} • CC margin-based bound • = min(x,y)S min (i < j)y vi.x – vj.x • m - number of examples • R - maxx ||x|| • - confidence • C - average # constraints
VC-style Generalization Bounds • Linear Hypothesis space: • h(x) = argsort vi.x • vi, x Rd • argsort returns permutation of {1,...,k} • CC VC-based bound • m - number of examples • d - dimension of input space • delta - confidence • k - number of classes
Beyond Multiclass Classification • Ranking • category ranking (over classes) • ordinal regression (over examples) • Multilabel • x is both red and blue • Complex relationships • x is more red than blue, but not green • Millions of classes • sequence labeling (e.g. POS tagging) LATER • SNoW has an implementation of Constraint Classification for the Multi-Class case. Try to compare with 1-vs-all. • Experimental Issues: when is this version of multi-class better? • Several easy improvements are possible via modifying the loss function.
Multi-class Experiments Picture isn’t so clear for very high dimensional problems. Why?
Structured Output Learning • Abstract View: • Decomposition versus Constraint Classification • More details: Inference with Classifiers
A2: benefactor A0 : leaver A1: thing left Structured Output Learning:Semantic Role Labeling • For each verb in a sentence • Identify all constituents that fill a semantic role • Determine their roles • Core Arguments, e.g., Agent, Patient or Instrument • Their adjuncts, e.g., Locative, Temporal or Manner Y : All possible ways to label the tree C(Y): All valid ways to label the tree argmaxy C(Y) g(x,y) my child I left my pearls to
y1 y2 y3 my child I left my pearls to Components of Structured Output Learning • Input: X • Output: A collection of variables • Y = (y1,...,yL) {1,...,K}L • Length is example dependent • Constraints on the Output C(Y) • e.g. non-overlapping, no repeated values... • partition output to valid and invalid assignments • Representation • scoring function: g(x,y) • e.g. linear: g(x,y) = w (x,y) • Inference • h(x) = argmaxvalid yg(x,y) Y X
Decomposition-based Learning • Many choices for decomposition • Depends on problem, learning model, computation resources, etc... • Value-based decomposition • A function for each output value • fk(x,l), k = {1,..,K} • e.g. SRL tagging fA0(x,node), fA1(x,node),... • OvA learning • fk(x,node) > 0 iff k=y
Learning Discriminant Functions: The General Setting • g(x,y) > g(x,y’)y’ Y \ y • w (x,y) > w (x,y’) y’ Y \ y • w (x,y,y’) = w ((x,y) - (x,y’)) > 0 • P(x,y) = {(x,y,y’)} y’ Y \ y • P(S) = {P(x,y)}(x,y) S • Learn unary classifer over P(S) (binary) (+P(S),-P(S)) • Used in many works [C02,WW00,CS01,CM03,TGK03]
scoreNONE(3) scoreA2(13) my child I left my pearls to Structured Output Learning:Semantic Role Labeling • Learn a collection of “scoring” functions • wA0A0(x,y,n) , wA1A1(x,y,n),... • scorev(x,y,n) = wvv(x,y,n) • Global score • g(x,y) = n scoreyn(x,y,n) = nwynyn(x,y,n) • Learn locally (LO, L+I) • for each label variable (node) n = A0 • gA0(x,y,n) = wA0A0(x,y,n) > 0 iff yn = A0 • Discriminant model dictates: • g(x,y) > g(x,y’), y C(Y) • argmaxy C(Y) g(x,y) • Learn Globally (IBT) • g(x,y) = w (x,y)
Summary Multi-class Structured Output