1 / 15

Techniques for PoS Tagging

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation of Part of Speech (PoS) Tagging Problem. Techniques for PoS Tagging. Statistical – Use some probabilistic methods

Download Presentation

Techniques for PoS Tagging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS460/IT632Natural Language Processing/Language Technology for the WebLecture 3 (10/01/06)Prof. Pushpak BhattacharyyaIIT BombayStatistical Formulation of Part of Speech (PoS) Tagging Problem

  2. Techniques for PoS Tagging • Statistical – Use some probabilistic methods • Rule-Based – Use some linguistic/machine learnt rules for tagging Prof. Pushpak Bhattacharyya, IIT Bombay

  3. Uses of PoS Tagging • Parsing • Machine Translation • Question Answering • Text-to-Speech System • Homography – same orthography (spelling) but different pronunciation. Ex – lead as verb and noun Prof. Pushpak Bhattacharyya, IIT Bombay

  4. Noisy Channel Based Modeling word tag sequence sequence W C C* = best tag sequence = argmax P(C|W) C Noisy Prof. Pushpak Bhattacharyya, IIT Bombay

  5. Applying Bayes Theorem C* = argmax P(C|W) C = argmax P(C) . P(W|C) C prior likelihood Prof. Pushpak Bhattacharyya, IIT Bombay

  6. Prior - Bigram Probability P(C) =P(C1|C0).P(C2|C1C0).P(C3|C2C1C0)……P(Cn|Cn-1Cn-2…) k-gram approximation (Markov’s assumption) k = 2; bigram assumption P(C) =  P(Ci|Ci-1) i=1 to n Prof. Pushpak Bhattacharyya, IIT Bombay

  7. Likelihood – Lexical Generation Probability P(W|C) = P(W1|C1C2…Cn) . P(W2|W1C1C2…Cn)…… P(Wn|Wn-1Wn-2…W1C1C2…Cn) Approximation – Wi depends only on Ci So, P(Wi|Wi-1Wi-2…W1C1C2…Cn) = P(Wi|Ci) Hence, P(W|C) =  P(Wi|Ci) i=1 to n C* =  P(Ci|Ci-1) P(Wi|Ci) i=1 to n Prof. Pushpak Bhattacharyya, IIT Bombay

  8. Tagging Situation Input – “Humans are fond of animals and birds. They keep pets at home” Output – Humans_NNS are_VBP fond_JJ of_IN animals_NNS and_CC birds_NNS._. They_PRNS keep_VB pets_NNS at_IN home_NNP._. Note: The tags are PEN TAGS. Prof. Pushpak Bhattacharyya, IIT Bombay

  9. Formulating the Problem C’k1 Humans are fond of animals C’k2 C’k3 Let C’ki be the possible tags for the corresponding words C’k4 C’k5 C’k6 C’k7 C’k8 C’k9 C’k10 Prof. Pushpak Bhattacharyya, IIT Bombay

  10. P(NNS|C0).P(Humans|NNS) Humans: NNS C0 Humans: JJ P(JJ|C0).P(Humans|JJ) Formulating the Problem (Contd) Let the word “Humans” has two tags – NNS and JJ Then the probabilities involved are – P(NNS|C0) = .00083 P(JJ|C0) = .000074 P(Humans|NNS) = .0000093 P(Humans|JJ) = .0000001 Should we choose the maximum product path? Prof. Pushpak Bhattacharyya, IIT Bombay

  11. Calculating Probabilities We calculate the probabilities by ‘counting’. P(NNS|C0) = #NNS followed C0 #C0 P(Humans|NNS) = #Humans out of NNS #NNS Prof. Pushpak Bhattacharyya, IIT Bombay

  12. Languages – Rich and Poor • Rich languages have annotated corpora, tools, language knowledge bases etc. • Poor languages do not have the above stated things. Prof. Pushpak Bhattacharyya, IIT Bombay

  13. Theoretical Foundations • Hidden Markov Model (HMM) – It is a non-deterministic finite state machine with probabilities associated with each arc. • Viterbi Algorithm – Will be covered in the coming lectures a: 0.1 a: 0.2 a: 0.4 a: 0.2 S0 S0 b: 0.1 b: 0.5 b: 0.3 b: 0.2 Prof. Pushpak Bhattacharyya, IIT Bombay

  14. S0 a a S0 S1 a a a a S0 S1 S0 S1 and so and so forth… What is ‘Hidden’ in HMM Given an output sequence, we do not know which states the machine has transited through. Let the sequence of alphabets is ‘aaba’ - Prof. Pushpak Bhattacharyya, IIT Bombay

  15. HMM and PoS Tagging In PoS Tagging, • Alphabets correspond to words • States correspond to tags After seeing the alphabet sequence (Humans are fond of animals), find the state sequence that generated it (PoS tag sequence) Prof. Pushpak Bhattacharyya, IIT Bombay

More Related