Graphical models for part of speech tagging

Graphical models for part of speech tagging

Different Models for POS tagging • HMM • Maximum Entropy Markov Models • Conditional Random Fields

POS tagging: A Sequence Labeling Problem • Input and Output • Input sequence x= x1x2xn • Output sequence y= y1y2ym • Labels of the input sequence • Semantic representation of the input • Other Applications • Automatic speech recognition • Text processing, e.g., tagging, name entity recognition, summarization by exploiting layout structure of text, etc.

0.5 0.9 0.5 0.1 0.8 0.2 Hidden Markov Models • Doubly stochastic models • Efficient dynamic programming algorithms exist for • Finding Pr(S) • The highest probability path P that maximizes Pr(S,P) (Viterbi) • Training the model • (Baum-Welch algorithm) A C 0.6 0.4 A C 0.9 0.1 S1 S2 S4 S3 A C 0.5 0.5 A C 0.3 0.7

Hidden Markov Model (HMM) : Generative Modeling Source Model P(Y) Noisy Channel P(X|Y) y x e.g., 1st order Markov chain Parameter estimation: maximize the joint likelihood of training examples

Dependency (1st order)

Disadvantage of HMMs (1) • No Rich Feature Information • Rich information are required • When xk is complex • When data of xk is sparse • Example: POS Tagging • How to evaluate P(wk|tk) for unknown words wk ? • Useful features • Suffix, e.g., -ed, -tion, -ing, etc. • Capitalization

Disadvantage of HMMs (2) • Generative Model • Parameter estimation: maximize the joint likelihood of training examples • Better Approach • Discriminative model which models P(y|x) directly • Maximize the conditional likelihood of training examples

Maximum Entropy Markov Model • Discriminative Sub Models • Unify two parameters in generative model into one conditional model • Two parameters in generative model, • parameter in source model and parameter in noisy channel • Unified conditional model • Employ maximum entropy principle • Maximum Entropy Markov Model

General Maximum Entropy Model • Model • Model distribution P(Y|X) with a set of features {f1, f2, , fl} defined on X and Y • Idea • Collect information of features from training data • Assume nothing on distribution P(Y|X) other than the collected information • Maximize the entropy as a criterion

Features • Features • 0-1 indicator functions • 1 if (x, y)satisfies a predefined condition • 0 if not • Example: POS Tagging

Constraints • Empirical Information • Statistics from training data T • Expected Value • From the distribution P(Y|X) we want to model • Constraints

Maximum Entropy: Objective • Entropy • Maximization Problem

Dual Problem • Dual Problem • Conditional model • Maximum likelihood of conditional data • Solution • Improved iterative scaling (IIS) (Berger et al. 1996) • Generalized iterative scaling (GIS) (McCallum et al. 2000)

Maximum Entropy Markov Model • Use Maximum Entropy Approach to Model • 1st order • Features • Basic features (like parameters in HMM) • Bigram (1st order) or trigram (2nd order) in source model • State-output pair feature (Xk = xk,Yk=yk) • Advantage: incorporate other advanced features on (xk,yk)

HMM vs MEMM (1st order) Maximum Entropy Markov Model (MEMM) HMM

Performance in POS Tagging • POS Tagging • Data set: WSJ • Features: • HMM features, spelling features (like –ed, -tion, -s, -ing, etc.) • Results (Lafferty et al. 2001) • 1st order HMM • 94.31% accuracy, 54.01% OOV accuracy • 1st order MEMM • 95.19% accuracy, 73.01% OOV accuracy

Disadvantage of MEMMs (1) • Complex Algorithm of Maximum Entropy Solution • Both IIS and GIS are difficult to implement • Require many tricks in implementation • Slow in Training • Time consuming when data set is large • Especially for MEMM

Disadvantage of MEMMs (2) • Maximum Entropy Markov Model • Maximum entropy model as a sub model • Optimization of entropy on sub models, not on global model • Label Bias Problem • Conditional models with per-state normalization • Effects of observations are weakened for states with fewer outgoing transitions

Parameters Model i b r 2 3 1 o b r 5 6 4 Label Bias Problem Training Data X:Y rib:123 rib:123 rib:123 rob:456 rob:456 New input: rob

Solution • Global Optimization • Optimize parameters in a global model simultaneously, not in sub models separately • Alternatives • Conditional random fields • Application of perceptron algorithm

Conditional Random Field (CRF) (1) • Let • be a graph such that Y is indexed by the vertices • Then • (X, Y) is a conditional random field if • Conditioned globally on X

Conditional Random Field (CRF) (2) Determined by State Transitions • Exponential Model • : a tree (or more specifically, a chain) with cliques as edges and vertices State determined • Parameter Estimation • Maximize the conditional likelihood of training examples • IIS or GIS

MEMM vs CRF • Similarities • Both employ maximum entropy principle • Both incorporate rich feature information • Differences • Conditional random fields are always globally conditioned on X, resulting in a global optimized model

Performance in POS Tagging • POS Tagging • Data set: WSJ • Features: • HMM features, spelling features (like –ed, -tion, -s, -ing, etc.) • Results (Lafferty et al. 2001) • 1st order MEMM • 95.19% accuracy, 73.01% OOV accuracy • Conditional random fields • 95.73% accuracy, 76.24% OOV accuracy

Comparison of the three approaches to POS Tagging • Results (Lafferty et al. 2001) • 1st order HMM • 94.31% accuracy, 54.01% OOV accuracy • 1st order MEMM • 95.19% accuracy, 73.01% OOV accuracy • Conditional random fields • 95.73% accuracy, 76.24% OOV accuracy

References • A. Berger, S. Della Pietra, and V. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39-71. • J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001, 282-289.

Graphical models for part of speech tagging