CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

CS546: Machine Learning and Natural LanguageMulti-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

Outline • Multi-Class classification: • Structured Prediction • Models for Structured Prediction and Classification • Example of POS tagging

Mutliclass problems • Most of the machinery we talked before was focused on binary classification problems • e.g., SVMs we discussed so far • However most problems we encounter in NLP are either: • MultiClass: e.g., text categorization • Structured Prediction: e.g., predict syntactic structure of a sentence • How to deal with them?

Binary linear classification

Multiclass classification

Perceptron

Structured Perceptron • Joint feature representation: • Algoritm:

Perceptron

Binary Classification Margin

Generalize to MultiClass

Converting to MultiClass SVM

Max margin = Min Norm • As before, these are equivalent formulations:

Problems: • Requires separability • What if we have noise in data? • What if we have little simple feature space?

Non-separable case

Compare with MaxEnt

Loss Comparison

Multiclass -> Structured • So far, we considered multiclass classification • 0-1 losses l(y,y’) • What if what we want to do is to predict: • sequences of POS • syntactic trees • translation

Predicting word alignments

Predicting Syntactic Trees

Structured Models

Parsing

Max Margin Markov Networks (M3Ns) Taskar et al, 2003; similar Tsochantaridis et al, 2004

Max Margin Markov Networks (M3Ns)

Solving MultiClass with binary learning • MultiClass classifier • Function f : Rd {1,2,3,...,k} • Decompose into binary problems • Not always possible to learn • Different scale • No theoretical justification Real Problem

Learning via One-Versus-All (OvA) Assumption • Find vr,vb,vg,vy Rn such that • vr.x > 0 iff y = red  • vb.x > 0 iff y = blue • vg.x > 0 iff y = green • vy.x > 0 iff y = yellow • Classifier f(x) = argmax vi.x H = Rkn Individual Classifiers Decision Regions

H = Rkkn How to classify? Learning via All-Verses-All (AvA) Assumption • Find vrb,vrg,vry,vbg,vby,vgyRd such that • vrb.x > 0 if y = red < 0 if y = blue • vrg.x > 0 if y = red < 0 if y = green • ... (for all pairs) Individual Classifiers Decision Regions

Tree Majority Vote 1 red, 2 yellow, 2 green  ? Tournament Classifying with AvA All are post-learning and might cause weird stuff

POS Tagging • English tags

POS Tagging, examples from WSJ From McCallum

POS Tagging • Ambiguity: not a trivial task • Useful tasks: • important features for other steps are based on POS • E.g., use POS as input to a parser

But still why so popular • Historically the first statistical NLP problem • Easy to apply arbitrary classifiers: • both for sequence models and just independent classifiers • Can be regarded as Finite-State Problem • Easy to evaluate • Annotation is cheaper to obtain than TreeBanks (other languages)

HMM (reminder)

HMM (reminder) - transitions

Transition Estimates

Emission Estimates

MaxEnt (reminder)

Decoding: HMM vs MaxEnt

Accuracies overview

SVMs for tagging • We can use SVMs in a similar way as MaxEnt (or other classifiers) • We can use a window around the word • 97.16 % on WSJ

SVMs for tagging from Jimenez & Marquez

No sequence modeling

CRFs and other global models

Compare W HMMs T MEMMs - Note: after each step t the remaining probability mass cannot be reduced – it can only be distributed across among possible state transitions CRFs - no local normalization

Label Bias based on a slide from Joe Drish

Label Bias • Recall Transition based parsing -- Nivre’s algorithm (with beam search) • At each step we can observe only local features (limited look-ahead) • If later we see that the following word is impossible we can only distribute probability uniformly across all (im-)possible decisions • If a small number of such decisions – we cannot decrease probability dramatically • So, label bias is likely to be a serious problem if: • Non local dependencies • States have small number of possible outgoing transitions

Pos Tagging Experiments • “+” is an extended feature set (hard to integrate in a generative model) • oov – out-of-vocabulary

Supervision • We considered before the supervised case • Training set is labeled • However, we can try to induce word classes without supervision • Unsupervised tagging • We will later discuss the EM algorithm • Can do it in a partly supervised: • Seed tags • Small labeled dataset • Parallel corpus • ....

CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

Presentation Transcript

Multi-Class and Structured Classification

CS546 Spring 2009 Machine Learning in Natural Language

Understanding Time In Natural Language: -- Structured Learning, Common Sense, And Data Collection

Language Resources and Machine Learning

CS 595-052 Machine Learning and Statistical Natural Language Processing

CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Pars

CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers

Efficient Decomposed Learning for Structured Prediction

Machine Learning for Natural Language Processing

Seizure prediction and machine learning

Mechanism Design, Machine Learning, and Pricing Problems

An Introduction to Machine Learning and Natural Language Processing Tools

Language Technology Machine learning of natural language

Mechanism Design, Machine Learning, and Pricing Problems

CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009

Machine Learning Natural Language Processing

AI – Week 23 – TERM 2 Machine Learning and Natural Language Processing

CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

Machine Learning in Natural Language

CS 391L: Machine Learning Natural Language Learning

Structured Prediction and Active Learning for Information Retrieval