POS tagging and Chunking for Indian Languages

POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad

Contents • NLP : Introduction • Language Analysis - Representation • Part-of-speech tags in Indian Languages (Ex. Hindi) • Corpus based methods: An introduction • POS tagging using HMMs • Introduction to TnT • Chunking for Indian languages – Few experiments • Shared task - Introduction

Language • A unique ability of humans • Animals have signs – Sign for danger • But cannot combine the signs • Higher animals – Apes • Can combine symbols (noun & verb) • But can talk only about here and now

Language : Means of Communication CONCEPT CONCEPT coding decoding Language * The concept gets transferred through language

Language : Means of thinking What should I wear today? * Can we think without language ?

What is NLP ? The process of computer analysis of input provided in a human language is known as Natural Language Processing. Concept Language Used for processing by computer Intermediate representation

Applications • Machine translation • Document Clustering • Information Extraction / Retrieval • Text classification

MT system : Shakti • Machine translation system being developed at • IIIT – Hyderabad. • A hybrid translation system which uses the combined • strengths of Linguistic, Statistical and Machine learning • techniques. • Integrates the best available NLP technologies.

Shakti architecture Morphology POS tagging Chunking Parsing English sentence English sentence analysis Word reordering Hindi word subs. Transfer from English to Hindi Hindi sentence generation Agreement Word-generation Hindi sentence

Levels of Language Analysis • Morphological analysis • Lexical Analysis ( POS tagging ) • Syntactic Analysis ( Chunking, Parsing ) • Semantic Analysis ( Word sense disambiguation ) • Discourse processing ( Anaphora resolution ) Let’s take an example sentence “Children are watching some programmes on television in the house”

Chunking • What are chunks ? • [[ Children ]] (( are watching )) [[ some programmes ]] [[ on television ]] [[ in the house ]] • Chunks • Noun chunks (NP, PP) in square brackets • Verb chunks (VG) in parentheses • Chunks represent objects • Noun chunks represent objects/concepts • Verb chunks represent actions

Chunking • Representation in SSF

Part-of-Speech tagging

Morphological analysis • Deals with the word form and it’s analysis. • Analysis consists of characteristic properties like • Root/Stem • Lexical category • Gender, number, person … • Etc … • Ex: watching • Root = watch • Lexical category = verb • Etc …

Morphological analysis

POS Tags in Hindi • POS Tags in Hindi • Broadly categories are noun, verb, adjective & adverb. • Word are classified depending on their role, both individually as well as in the sentence. • Example: • vaha aama khaa rahaa hei • Pron noun verb verb verb

POS Tagging • Simplest method of POS tagging • Looking in the dictionary khaanaa Dictionary lookup verb

Problems with POS Tagging • Size of the dictionary limits the scope of POS-tagger. • Ambiguity • The same word can be used both as a noun as well as a verb. khaanaa noun verb

Problems with POS Tagging • Ambiguity • Sentences in which the word “khaanaa” occurs • tum bahuta achhaa khaanaa banatii ho. • mein jilebii khaanaa chaahataa hun. • Hence, complete sentence has to be looked at before determining it’s role and thus the POS tag.

Problems with POS Tagging • Many applications need more specific POS tags. • For example, • Hence, the need for defining a tagset.

Defining the tagset for Hindi (IIIT Tagset) • Issues ! • Fineness V/s Coarseness in linguistic analysis • Syntactic Function V/s lexical category • New tags V/s tags from a standard tagger

Fineness V/s Coarseness • Decision has to be taken whether tags will account for finer distinctions of various features of the parts of speech. • Need to strike a balance • Not too fine to hamper machine learning • Not too coarse to loose information

Fineness V/s Coarseness • Nouns • Plurality information not taken into account • (noun singular and noun plural are marked with same tags). • Case information not marked • (noun direct and noun oblique are marked with same tags). • Adjectives and Adverbs • No distinction between comparitive and superlative forms • Verbs • Finer distinctions are made (eg., VJJ, VRB, VNN) • Helps us understand the arguments that a verb form can take.

Fineness in Verb tags • Useful for tasks like dependency parsing as we have better information about arguments of verb form. • Non-finite form of verbs which are used as nouns or adjectives or adverbs still retain their verbal property. • (VNN -> Noun formed for a verb) • Example: aasamaana/NN mein/PREP udhane/VNN vaalaa/PREP ghodhaa/NN “sky” “in” “flying” “horse” niiche/NLOC utara/VFM aayaa/VAUX “down” “climb” “came”

Syntactic V/S Lexical • Whether to tag the word based on lexical or syntactic category. • Should “uttar” in “uttar bhaarata” be tagged as noun or adjective ? • Lexical category is given more importance than syntactic category while marking text manually. • Leads to consistency in tagging.

New tags v/s tags from standard tagset • Entirely new tagset for Indian languages not desirable as people are familiar with standard tagsets like Penn tags. • Penn tagset has been used as benchmark while deciding tags for Hindi. • Wherever Penn tagset has been found inadequate, new tags introduced. • NVB  New tag for kriyamuls or Light verbs • QW  Modified tag for question words

IIIT Tagset • Tags are grouped into three types. • Group1 : Adopted from the Penn tagset with minor changes. • Group2 : Modification over Penn tagset. • Group3 : Tags not present in Penn tagset. • Examples of tags in Group3 • INTF ( Intensifier ) : Words like ‘baHuta’, ‘kama’ etc. • NVB, JVB, RBVB : Light verbs. • Detailed guidelines would be put online.

Corpus – based approach Untagged new corpus Learn POS tagged corpus POS tagger Tagged new corpus

POS tagging : A simple method • Pick the most likely tag for each word • Probabilities can be estimated from a • tagged corpus. • Assumes independence between tags. • Accuracy < 90%

POS tagging : A simple method • Example • Brown corpus, 182159 tagged words (training section), • 26 tags • Example : • mujhe xo kitabein xijiye • Word xo occurs 267 times, • 227 times tagged as QFN • 29 times as VAUX • P(QFN|W=xo) = 227/267 = 0.8502 • P(NN | W=xo) = 29/267 = 0.1086

Corpus-based approaches

POS tagging using HMMs Let W be a sequence of words W = w1 , w2 … wn Let T be the corresponding tag sequence T = t1 , t2 … tn Task : Find T which maximizes P ( T | W ) T’ = argmaxT P ( T | W )

POS tagging using HMM • If we use Tri-gram model instead for the tag sequence, • P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 ) • Which model to choose ? • Depends on the amount of data available ! • Richer models ( Tri-grams, 4-grams ) require lots of data.

Chain rule with approximations: Example • P (vaha | det ) = ( Number of times ‘vaha’ appeared as ‘det’ in the corpus ) • ------------------------------------------------------------- • ( Total number of occurrences of ‘det’ in the corpus ) • P ( verb | noun ) = ( Number of times ‘verb’ followed ‘noun’ in the corpus ) • ------------------------------------------------------------- • ( Total number of occurrences of ‘noun’ in the corpus ) • If we obtained the following estimates from the corpus det noun verb 0.5 0.99 0.4 0.4 0.5 0.02 ladakaa gayaa vaha P ( W , T ) = 0.5 * 0.4 * 0.99 * 0.5 * 0.4 * 0.02 = 0.000792

POS tagging using HMM We need to estimate three types of parameters from the corpus Pstart(ti) = (no. of sentences which begin with ti ) / ( no. of sentences ) P ( ti | ti-1 ) = count ( ti-1 ti ) / count ( ti-1 ) P ( wi | ti ) = count ( wi with ti ) / count ( ti ) These parameters can be directly represented using the Hidden Markov Models (HMMs) and the best tag sequence can be computed by applying Viterbi algorithm on the HMMs.

Markov models • Markov Chain • An event is dependent on the previous events. • Consider the word sequence usane kahaa ki Here, each word is dependent on the previous one word. Hence, it is said to form markov chain of order 1.

Hidden Markov models Observation sequence O o1 o2 o3 o4 Hidden states sequence X x3 x1 x2 x4 3 4 1 2 Index of sequence t Hidden states follow markov property. Hence, this model is know as Hidden Markov Model.

Hidden Markov models • Representation of parameters in HMMs • Define O(t) = tth Observation • Define X(t) = Hidden State Value at tth position A = aab = P ( X ( t+1 ) = Xb | X ( t ) = Xa )  Transition matrix B = bak = P ( O ( t ) = Ok | X ( t ) = Xa )  Emission matrix PI = pia = Probability of the starting with hidden state Xa  PI matrix The model is μ = { A , PI , B }

HMM for POS tagging Observation sequence === Word sequence Hidden state sequence === Tag sequence Model A = P ( current tag | previous tag ) B = P ( current word | current tag ) PI = Pstart ( tag ) Tag sequences are mapped to Hidden state sequences because they are not observable in the natural language text.

Example A = PI = B =

POS tagging using HMM The problem can be formulated as, Given the observation sequence O and the model μ = (A, B, PI), how to choose the best state sequence X which explains the observations ? • Consider all the possible tag sequences and choose the • tag sequence having the maximum joint probability with • the observation sequence. • X_max = argmax ( P(O , X) ) • The complexity of the above is high. Order NT • Viterbi algorithm is used for computational efficiency.

POS tagging using HMM ladakaa hansaa vaha O det det det noun noun noun X’s verb verb verb t 1 2 3 27 tag sequences possible ! = 27 paths

Viterbi algorithm ladakaa hansaa vaha O det det det noun noun noun X’s verb verb verb t 1 2 3 Let αnoun(ladakaa) represent the probability of reaching the state ‘noun’ taking the best possible path and generating observation ‘ladakaa’

POS tagging and Chunking for Indian Languages