150 likes | 283 Views
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation of Part of Speech (PoS) Tagging Problem. Techniques for PoS Tagging. Statistical – Use some probabilistic methods
E N D
CS460/IT632Natural Language Processing/Language Technology for the WebLecture 3 (10/01/06)Prof. Pushpak BhattacharyyaIIT BombayStatistical Formulation of Part of Speech (PoS) Tagging Problem
Techniques for PoS Tagging • Statistical – Use some probabilistic methods • Rule-Based – Use some linguistic/machine learnt rules for tagging Prof. Pushpak Bhattacharyya, IIT Bombay
Uses of PoS Tagging • Parsing • Machine Translation • Question Answering • Text-to-Speech System • Homography – same orthography (spelling) but different pronunciation. Ex – lead as verb and noun Prof. Pushpak Bhattacharyya, IIT Bombay
Noisy Channel Based Modeling word tag sequence sequence W C C* = best tag sequence = argmax P(C|W) C Noisy Prof. Pushpak Bhattacharyya, IIT Bombay
Applying Bayes Theorem C* = argmax P(C|W) C = argmax P(C) . P(W|C) C prior likelihood Prof. Pushpak Bhattacharyya, IIT Bombay
Prior - Bigram Probability P(C) =P(C1|C0).P(C2|C1C0).P(C3|C2C1C0)……P(Cn|Cn-1Cn-2…) k-gram approximation (Markov’s assumption) k = 2; bigram assumption P(C) = P(Ci|Ci-1) i=1 to n Prof. Pushpak Bhattacharyya, IIT Bombay
Likelihood – Lexical Generation Probability P(W|C) = P(W1|C1C2…Cn) . P(W2|W1C1C2…Cn)…… P(Wn|Wn-1Wn-2…W1C1C2…Cn) Approximation – Wi depends only on Ci So, P(Wi|Wi-1Wi-2…W1C1C2…Cn) = P(Wi|Ci) Hence, P(W|C) = P(Wi|Ci) i=1 to n C* = P(Ci|Ci-1) P(Wi|Ci) i=1 to n Prof. Pushpak Bhattacharyya, IIT Bombay
Tagging Situation Input – “Humans are fond of animals and birds. They keep pets at home” Output – Humans_NNS are_VBP fond_JJ of_IN animals_NNS and_CC birds_NNS._. They_PRNS keep_VB pets_NNS at_IN home_NNP._. Note: The tags are PEN TAGS. Prof. Pushpak Bhattacharyya, IIT Bombay
Formulating the Problem C’k1 Humans are fond of animals C’k2 C’k3 Let C’ki be the possible tags for the corresponding words C’k4 C’k5 C’k6 C’k7 C’k8 C’k9 C’k10 Prof. Pushpak Bhattacharyya, IIT Bombay
P(NNS|C0).P(Humans|NNS) Humans: NNS C0 Humans: JJ P(JJ|C0).P(Humans|JJ) Formulating the Problem (Contd) Let the word “Humans” has two tags – NNS and JJ Then the probabilities involved are – P(NNS|C0) = .00083 P(JJ|C0) = .000074 P(Humans|NNS) = .0000093 P(Humans|JJ) = .0000001 Should we choose the maximum product path? Prof. Pushpak Bhattacharyya, IIT Bombay
Calculating Probabilities We calculate the probabilities by ‘counting’. P(NNS|C0) = #NNS followed C0 #C0 P(Humans|NNS) = #Humans out of NNS #NNS Prof. Pushpak Bhattacharyya, IIT Bombay
Languages – Rich and Poor • Rich languages have annotated corpora, tools, language knowledge bases etc. • Poor languages do not have the above stated things. Prof. Pushpak Bhattacharyya, IIT Bombay
Theoretical Foundations • Hidden Markov Model (HMM) – It is a non-deterministic finite state machine with probabilities associated with each arc. • Viterbi Algorithm – Will be covered in the coming lectures a: 0.1 a: 0.2 a: 0.4 a: 0.2 S0 S0 b: 0.1 b: 0.5 b: 0.3 b: 0.2 Prof. Pushpak Bhattacharyya, IIT Bombay
S0 a a S0 S1 a a a a S0 S1 S0 S1 and so and so forth… What is ‘Hidden’ in HMM Given an output sequence, we do not know which states the machine has transited through. Let the sequence of alphabets is ‘aaba’ - Prof. Pushpak Bhattacharyya, IIT Bombay
HMM and PoS Tagging In PoS Tagging, • Alphabets correspond to words • States correspond to tags After seeing the alphabet sequence (Humans are fond of animals), find the state sequence that generated it (PoS tag sequence) Prof. Pushpak Bhattacharyya, IIT Bombay