Techniques for PoS Tagging

CS460/IT632Natural Language Processing/Language Technology for the WebLecture 3 (10/01/06)Prof. Pushpak BhattacharyyaIIT BombayStatistical Formulation of Part of Speech (PoS) Tagging Problem

Techniques for PoS Tagging • Statistical – Use some probabilistic methods • Rule-Based – Use some linguistic/machine learnt rules for tagging Prof. Pushpak Bhattacharyya, IIT Bombay

Uses of PoS Tagging • Parsing • Machine Translation • Question Answering • Text-to-Speech System • Homography – same orthography (spelling) but different pronunciation. Ex – lead as verb and noun Prof. Pushpak Bhattacharyya, IIT Bombay

Noisy Channel Based Modeling word tag sequence sequence W C C* = best tag sequence = argmax P(C|W) C Noisy Prof. Pushpak Bhattacharyya, IIT Bombay

Applying Bayes Theorem C* = argmax P(C|W) C = argmax P(C) . P(W|C) C prior likelihood Prof. Pushpak Bhattacharyya, IIT Bombay

Tagging Situation Input – “Humans are fond of animals and birds. They keep pets at home” Output – Humans_NNS are_VBP fond_JJ of_IN animals_NNS and_CC birds_NNS._. They_PRNS keep_VB pets_NNS at_IN home_NNP._. Note: The tags are PEN TAGS. Prof. Pushpak Bhattacharyya, IIT Bombay

Formulating the Problem C’k1 Humans are fond of animals C’k2 C’k3 Let C’ki be the possible tags for the corresponding words C’k4 C’k5 C’k6 C’k7 C’k8 C’k9 C’k10 Prof. Pushpak Bhattacharyya, IIT Bombay

Calculating Probabilities We calculate the probabilities by ‘counting’. P(NNS|C0) = #NNS followed C0 #C0 P(Humans|NNS) = #Humans out of NNS #NNS Prof. Pushpak Bhattacharyya, IIT Bombay

Languages – Rich and Poor • Rich languages have annotated corpora, tools, language knowledge bases etc. • Poor languages do not have the above stated things. Prof. Pushpak Bhattacharyya, IIT Bombay

Theoretical Foundations • Hidden Markov Model (HMM) – It is a non-deterministic finite state machine with probabilities associated with each arc. • Viterbi Algorithm – Will be covered in the coming lectures a: 0.1 a: 0.2 a: 0.4 a: 0.2 S0 S0 b: 0.1 b: 0.5 b: 0.3 b: 0.2 Prof. Pushpak Bhattacharyya, IIT Bombay

S0 a a S0 S1 a a a a S0 S1 S0 S1 and so and so forth… What is ‘Hidden’ in HMM Given an output sequence, we do not know which states the machine has transited through. Let the sequence of alphabets is ‘aaba’ - Prof. Pushpak Bhattacharyya, IIT Bombay

HMM and PoS Tagging In PoS Tagging, • Alphabets correspond to words • States correspond to tags After seeing the alphabet sequence (Humans are fond of animals), find the state sequence that generated it (PoS tag sequence) Prof. Pushpak Bhattacharyya, IIT Bombay

Techniques for PoS Tagging