A Practical Part-of-Speech Tagger by Doug Cutting et al. in 3rd Conference on Applied NLP 1992

A Practical Part-of-Speech Taggerby Doug Cutting et al.in 3rd Conference on Applied NLP 1992 Insu Kang KLE Lab. CSE POSTECH

1. Abstract • Implementation of POS tagger based on HMM • Resources : a lexicon + unlabeled training text • Accuracy : 96 % • Implementation strategies and optimizations • Three applications for tagging • Phrase recognition • WSD • Grammatical function assignment

2. Introduction • Many words are ambiguous in their part of speech • Ambiguity reduction using the context of other words • Automatic text tagging : the first step to linguistic analyses • Requirements for a tagger • Robust • ungrammatical constructions, isolated phrases, non-linguistic data • Efficient • performing in time linear in the number of words tagged • Accurate • assign the correct part-of-speech tag • Tunable • possible to give different linguistic hints for different corpora • Reusable • retarget a tagger to new corpora

3. Methodology • Rule-based : [Greene and Rubin 71] [Koskeniemi 90] • Statistical : [DeRose 88] [Garside 87] • Based on markov assumptions • p(ti / w1t1 w2t2 ... Wi-1ti-1) = p(ti / ti-2 ti-1) • p(wi / w1t1 w2t2 ... Wi-1ti-1ti) = p(wi / ti) • Parameter estimation • Tagged corpus • Untagged corpus (HMM) : [Jelinek 85] • Baum-Welch algorithm[Baum 72] : (forward-backward algorithm) • Parameter smoothing

4. This paper’s Approach • HMM : advantages for unsupervised learning • No need for an annotated training corpus • Alternate sets of POS categories are usable for training • Special POS categories for specialized domains • Model can be applied to other languages • Ambiguity classes : 4000 words -> 129 ambi. classes • provide a vocabulary-independent model • reduces the number of parameters required in the model • “play”, “touch” -> noun-or-verb class • “clay”, “zinc” -> noun class • 1st order model

5. HMM formalism • Constructs • S : a finite set of states N = |S| • V : an signal alphabet M = |V| • A : a state transition matrix aij • B : a signal matrix bj(k) •  : an initial vector i •  : forward variable •  : backward variable •  : forward-backward variable parameter estimation

6. Numerical Stability • Scale factor • products of probability numbers between 0 and 1 : easily undeflow • Viterbi algorithm : logarithm scale [Levinson 83] ti 0.5 -> 0.5 / 0.95 0.2 -> 0.2 / 0.95 0.25 -> 0.25 / 0.95

7. Reducing Time Complexity • Baum-Welch, Viterbi : O ( TN2 ) • Signal matrix B : sparcely populated • if average no. of non-zero entries for each row of B = K • O ( KTN ) K K K ti Ti+1

8. Reducing Space Complexity • A , B ,  : N2 + NM + N = N (N+M+1) • Baum-Welch requires a copy of A, B ,  • 2 N ( N + M + 1 ) : copies of A , B ,  • 2 N T :  ,  probabilities • T : output sequences storage • N , M : fixed, Let’s reduce T • unambiguous tokens • sentence ending marker • paragraph marker

9. Model Tuning • Determinants of the initial model • the choice of tagset and lexicon • biasing of starting values : empirical and priori information • favored tag • p(wi=“to” / Ci=“to-inf” ) = 1.0 • p(wi=“to” / Ci=“preposition” ) = 0.086 • p(wi=“unknown-word” / Ci=“noun” ) = 1.0 • p(wi=“unknown-word” / Ci=“open-class tags” ) = 0.001

10. Applications of tagging • Phrase recognition • use simple grammar for NP, VP, PP • contiguous sequence of tags • WSD : homograph disambiguation • different meaning according to part-of-speech • Gammar function assignment • phrase recognition • use a set of rules

A Practical Part-of-Speech Tagger by Doug Cutting et al. in 3rd Conference on Applied NLP 1992

A Practical Part-of-Speech Tagger by Doug Cutting et al. in 3rd Conference on Applied NLP 1992

Presentation Transcript

Developing a Persian Part of Speech Tagger

Part of Speech

Doug Jordan World Class Cutting

Lee et al , Applied Physics A (in press).

Semantics in NLP (part 2)

Part Of Speech

Part of Speech

3rd International Conference on Interdisciplinary Applied and Computational Mathematics

Applied speech Wool

Part A - Comments on the papers of Burovski et al.

By Golling et. al

3rd European Commission Conference on

3rd Global Networking Conference on RECP

3rd Australian Conference on Quality of Life

Immunology 3rd Practical

3rd INTERNATIONAL CONFERENCE ON TRANSPORT INFRASTRUCTURES

74.793 NLP and Speech

What is a part of speech?

Part of Speech