580 likes | 757 Views
CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems. Slides from Taskar and Klein are used in this lecture. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A. Outline. Multi-Class classification:
E N D
CS546: Machine Learning and Natural LanguageMulti-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA
Outline • Multi-Class classification: • Structured Prediction • Models for Structured Prediction and Classification • Example of POS tagging
Mutliclass problems • Most of the machinery we talked before was focused on binary classification problems • e.g., SVMs we discussed so far • However most problems we encounter in NLP are either: • MultiClass: e.g., text categorization • Structured Prediction: e.g., predict syntactic structure of a sentence • How to deal with them?
Structured Perceptron • Joint feature representation: • Algoritm:
Max margin = Min Norm • As before, these are equivalent formulations:
Problems: • Requires separability • What if we have noise in data? • What if we have little simple feature space?
Multiclass -> Structured • So far, we considered multiclass classification • 0-1 losses l(y,y’) • What if what we want to do is to predict: • sequences of POS • syntactic trees • translation
Max Margin Markov Networks (M3Ns) Taskar et al, 2003; similar Tsochantaridis et al, 2004
Solving MultiClass with binary learning • MultiClass classifier • Function f : Rd {1,2,3,...,k} • Decompose into binary problems • Not always possible to learn • Different scale • No theoretical justification Real Problem
Learning via One-Versus-All (OvA) Assumption • Find vr,vb,vg,vy Rn such that • vr.x > 0 iff y = red • vb.x > 0 iff y = blue • vg.x > 0 iff y = green • vy.x > 0 iff y = yellow • Classifier f(x) = argmax vi.x H = Rkn Individual Classifiers Decision Regions
H = Rkkn How to classify? Learning via All-Verses-All (AvA) Assumption • Find vrb,vrg,vry,vbg,vby,vgyRd such that • vrb.x > 0 if y = red < 0 if y = blue • vrg.x > 0 if y = red < 0 if y = green • ... (for all pairs) Individual Classifiers Decision Regions
Tree Majority Vote 1 red, 2 yellow, 2 green ? Tournament Classifying with AvA All are post-learning and might cause weird stuff
POS Tagging • English tags
POS Tagging, examples from WSJ From McCallum
POS Tagging • Ambiguity: not a trivial task • Useful tasks: • important features for other steps are based on POS • E.g., use POS as input to a parser
But still why so popular • Historically the first statistical NLP problem • Easy to apply arbitrary classifiers: • both for sequence models and just independent classifiers • Can be regarded as Finite-State Problem • Easy to evaluate • Annotation is cheaper to obtain than TreeBanks (other languages)
SVMs for tagging • We can use SVMs in a similar way as MaxEnt (or other classifiers) • We can use a window around the word • 97.16 % on WSJ
SVMs for tagging from Jimenez & Marquez
Compare W HMMs T MEMMs - Note: after each step t the remaining probability mass cannot be reduced – it can only be distributed across among possible state transitions CRFs - no local normalization
Label Bias based on a slide from Joe Drish
Label Bias • Recall Transition based parsing -- Nivre’s algorithm (with beam search) • At each step we can observe only local features (limited look-ahead) • If later we see that the following word is impossible we can only distribute probability uniformly across all (im-)possible decisions • If a small number of such decisions – we cannot decrease probability dramatically • So, label bias is likely to be a serious problem if: • Non local dependencies • States have small number of possible outgoing transitions
Pos Tagging Experiments • “+” is an extended feature set (hard to integrate in a generative model) • oov – out-of-vocabulary
Supervision • We considered before the supervised case • Training set is labeled • However, we can try to induce word classes without supervision • Unsupervised tagging • We will later discuss the EM algorithm • Can do it in a partly supervised: • Seed tags • Small labeled dataset • Parallel corpus • ....