1 / 58

CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems. Slides from Taskar and Klein are used in this lecture. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A. Outline. Multi-Class classification:

topper
Download Presentation

CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS546: Machine Learning and Natural LanguageMulti-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

  2. Outline • Multi-Class classification: • Structured Prediction • Models for Structured Prediction and Classification • Example of POS tagging

  3. Mutliclass problems • Most of the machinery we talked before was focused on binary classification problems • e.g., SVMs we discussed so far • However most problems we encounter in NLP are either: • MultiClass: e.g., text categorization • Structured Prediction: e.g., predict syntactic structure of a sentence • How to deal with them?

  4. Binary linear classification

  5. Multiclass classification

  6. Perceptron

  7. Structured Perceptron • Joint feature representation: • Algoritm:

  8. Perceptron

  9. Binary Classification Margin

  10. Generalize to MultiClass

  11. Converting to MultiClass SVM

  12. Max margin = Min Norm • As before, these are equivalent formulations:

  13. Problems: • Requires separability • What if we have noise in data? • What if we have little simple feature space?

  14. Non-separable case

  15. Non-separable case

  16. Compare with MaxEnt

  17. Loss Comparison

  18. Multiclass -> Structured • So far, we considered multiclass classification • 0-1 losses l(y,y’) • What if what we want to do is to predict: • sequences of POS • syntactic trees • translation

  19. Predicting word alignments

  20. Predicting Syntactic Trees

  21. Structured Models

  22. Parsing

  23. Max Margin Markov Networks (M3Ns) Taskar et al, 2003; similar Tsochantaridis et al, 2004

  24. Max Margin Markov Networks (M3Ns)

  25. Solving MultiClass with binary learning • MultiClass classifier • Function f : Rd {1,2,3,...,k} • Decompose into binary problems • Not always possible to learn • Different scale • No theoretical justification Real Problem

  26. Learning via One-Versus-All (OvA) Assumption • Find vr,vb,vg,vy Rn such that • vr.x > 0 iff y = red  • vb.x > 0 iff y = blue • vg.x > 0 iff y = green • vy.x > 0 iff y = yellow • Classifier f(x) = argmax vi.x H = Rkn Individual Classifiers Decision Regions

  27. H = Rkkn How to classify? Learning via All-Verses-All (AvA) Assumption • Find vrb,vrg,vry,vbg,vby,vgyRd such that • vrb.x > 0 if y = red < 0 if y = blue • vrg.x > 0 if y = red < 0 if y = green • ... (for all pairs) Individual Classifiers Decision Regions

  28. Tree Majority Vote 1 red, 2 yellow, 2 green  ? Tournament Classifying with AvA All are post-learning and might cause weird stuff

  29. POS Tagging • English tags

  30. POS Tagging, examples from WSJ From McCallum

  31. POS Tagging • Ambiguity: not a trivial task • Useful tasks: • important features for other steps are based on POS • E.g., use POS as input to a parser

  32. But still why so popular • Historically the first statistical NLP problem • Easy to apply arbitrary classifiers: • both for sequence models and just independent classifiers • Can be regarded as Finite-State Problem • Easy to evaluate • Annotation is cheaper to obtain than TreeBanks (other languages)

  33. HMM (reminder)

  34. HMM (reminder) - transitions

  35. Transition Estimates

  36. Emission Estimates

  37. MaxEnt (reminder)

  38. Decoding: HMM vs MaxEnt

  39. Accuracies overview

  40. Accuracies overview

  41. SVMs for tagging • We can use SVMs in a similar way as MaxEnt (or other classifiers) • We can use a window around the word • 97.16 % on WSJ

  42. SVMs for tagging from Jimenez & Marquez

  43. No sequence modeling

  44. CRFs and other global models

  45. CRFs and other global models

  46. Compare W HMMs T MEMMs - Note: after each step t the remaining probability mass cannot be reduced – it can only be distributed across among possible state transitions CRFs - no local normalization

  47. Label Bias based on a slide from Joe Drish

  48. Label Bias • Recall Transition based parsing -- Nivre’s algorithm (with beam search) • At each step we can observe only local features (limited look-ahead) • If later we see that the following word is impossible we can only distribute probability uniformly across all (im-)possible decisions • If a small number of such decisions – we cannot decrease probability dramatically • So, label bias is likely to be a serious problem if: • Non local dependencies • States have small number of possible outgoing transitions

  49. Pos Tagging Experiments • “+” is an extended feature set (hard to integrate in a generative model) • oov – out-of-vocabulary

  50. Supervision • We considered before the supervised case • Training set is labeled • However, we can try to induce word classes without supervision • Unsupervised tagging • We will later discuss the EM algorithm • Can do it in a partly supervised: • Seed tags • Small labeled dataset • Parallel corpus • ....

More Related