510 likes | 878 Views
Morphology 2 A case study of developing Bengali morph analyzer and generator. Sudeshna Sarkar IIT Kharagpur. Two level morphology. PC-KIMMO, a morphological parser based on Kimmo Koskenniemi's model of two-level morphology ( Koskenniemi 1983 ).
E N D
Morphology 2A case study of developing Bengali morph analyzer and generator Sudeshna Sarkar IIT Kharagpur
Two level morphology • PC-KIMMO, a morphological parser based on Kimmo Koskenniemi's model of two-level morphology ( Koskenniemi 1983). • Koskenniemi's model of two-level morphology was based on the traditional distinction that linguists make between • morphotactics, which enumerates the inventory of morphemes and specifies in what order they can occur, and • morphophonemics, which accounts for alternate forms or "spellings" of morphemes according to the phonological context in which they occur.
For example, the word chasedis analyzed morphotactically as the stem chase followed by the suffix -ed. • However, the addition of the suffix -ed apparently causes the loss of the final e of chase; thus chase and chas are allomorphs or alternate forms of the same morpheme. • Koskenniemi's model is "two-level" in the sense that a word is represented as a direct, letter-for-letter correspondence between its lexical or underlying form and its surface form. For example, the word chased is given this two-level representation (where + is a morpheme boundary symbol and 0 is a null character): Lexical form: c h a s e + e d Surface form: c h a s 0 0 e d
Main components of Karttunen's KIMMO parser • the rules component: two-level rules that accounted for regular phonological or orthographic alternations, such as chase versus chas. • lexical component: list all morphemes (stems and affixes) in their lexical form and specify morphotactic constraints.
Englex: a two-level description of English morphology • Englex consists of a set of orthographic rules, a 20,000-entry lexicon of roots and affixes, and a word grammar. With Englex and PC-KIMMO, you can morphologically parse English words and text.
Generative rules and 2-level rules • Two-level rules are similar to the rules of standard generative phonology, but differ in several crucial ways. Rule R1 is an example of a generative rule. R1 t ---> c / ___ i Rule R2 is the analogous two-level rule. R2 t:c => ___ i Generative rules • Transformational rules • Sequential application • Unidirectional Two-level rules • Declarative – talk about correspondences • They apply is parallel • Bidirectional
Hindi noun analysis A. Noun analysis Nouns are categorised into 20 different paradigms based on the following criterion: 1. Vowel ending. 2. Valid suffix of a word. 3. Gender, Number, Person and Case information. A snapshot of the analysis in shown in table 2.1. There are 20,000 Nouns classified in 20 such paradigms.
Hindi verb analysis B. Verb Analysis The Verb Group represents the following grammatical prop- erties: 1. Tense : Present, Past and Future. 2. Aspect: Durative, Stative, Infinitive, Habitual and Per- fective etc. 3. Modal: Abilitive, Deontic, Probabilitative etc. 4. Gender: Male, Female, Dual. 5. Person: 1st , 2nd and 3rd. These values formed the basis to list Verb Groups according to their TAM-GNP values. A TAM-GNP matrix having all possible VGs is developed. IITB morph analyzer Presently there are 622 unique paradigms in the TAM-GNP matrix
Morphology: Verb Attribute 1: Root Val 0: root word of the given surface form of the word Attribute 2: Category Val 0: verb (v) Attribute 3: Person Val 0: first, Val 1: second normal, Val 2: second familiar, Val 3: third normal, Val 4: formal (second/third) Attribute 4: Tense Val 0: Present, Val 1: Past, Val 2: Future Attribute 5: Aspect Val 0: simple, Val 1: continuous Val 2: perfect Attribute 6: Modality Attribute 8: Specificity Val 0: non-specific, Val 1: specific Attribute 9: Emphasizer Val 0: none, Val 1: only, Val 2: also Attribute 10: Polarity Val 0: positive Val 1: negative
Attributes & Values (Verb) : Person: • First Person-(1),Ami • Second Formal-(2),Apani • Second Normal-(3),tumi • Second Familiar-(4),tui • Third Normal-(5),se • Third Formal-(6),tini • Unspecified
Attributes & Values (Verb) : Tense: Present-(1),kari Past-(2),karalAma Future-(3),karaba Overall-(4)
Attributes & Values (Verb) : Aspect: Simple-(1),karalAma Habitual-(2),karatAma Continuous-(3),karachhe Perfect-(4),karechhi Indefinite-(5),kari
Attributes & Values (Verb) : Modality: • Indicative-(1),kara • Imperative-(2),kar • Subjunctive-(3),karale
Attributes & Values (Verb) : Polarity: Positive-(1),kari Negative-(2),karini
INFORMATION:VERBS • Total Numbers of Categories (Based on Syllabic Structure) : 20 • Rules:214/Category • Total Numbers of Rules : 214x20=4280(apprx.)
Classification : Nouns • Morphological Classification Based on Different Types of Nouns: • 1.Animate (example: mAnuSha) • 2.Inanimate(example: mATi) • 3.Abstract/Qualitative(example: daYA) • 4.Verbal(example : bhojana) • 5.Collective(example: pAla) • 6.The Singular (example: chandra) • 7.Compounded(example: riksAoYAlA)
Sub Classification :Nouns • Sub Classification based on “Root Endings”: • 1.a-ending root (animate “mAnusha”) • 2.A- ending root (animate “bAlikA”) • 3.i- ending root (animate “pAkhi”) • 4.I- ending root (animate “khukI”) • 5.e- ending root (animate “chhele”) • 6.o- ending root (animate “myA;o”) • 7.u-ending root (animate “shishu”) • 8.U- ending root (animate “badhU”)
Classification :Pronouns Morphological Analysis Based on Different Natures of Pronouns: 1.Personal (Ami,Apani,-) 2.Inclusive (saba,sakala,ubhaYa,-) 3.Relative(ye,yAhA,-) 4.Interrogative(ke,ki,-) 5.Denoting Others (anya,para,-) 6.Near Demonstrative (e,ihA,-) 7.Far Demonstrative (o,uhA,-) 8.Reflexive (nija,nijenije,-) 9.Indeffinite (keu,kichhu,-)
Morphology : Pronoun Attributes: • Number • Val 0: singular, Val 1: plural, Val 2: honorary plural • Form • Val 0: direct, Val 1: oblique • Specificity • Val 0: non-specific, Val 1: specific • Case • Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative • Emphatic Marker • Val 0: none, Val 1: only, Val 2: also • Ellipses • Val 0: false, Val 1: true • Nature • Types
Bengali POS Categories (Noun) Bengali Noun has the following attributes: Number, Specificity, Ellipses, Form, Case and Emphasizer • Number has 2 values (Singular and Plural) • Specificity has 2 values (Specific and non_specific) • Ellipses has 2 values (Elliptic and non_elliptic) • Form has 2 values (Direct and Oblique) • Case has 5 values (Nominative, Accusative, Genitive, Locative, Instrumental) • Emphasizer has 3 values (None, Only, Also)
Adjective Morphology Root • Val 0: root word of the given surface form of the word Specificity • Val 0: non-specific, Val 1: specific Emphasizer • Val 0: none, Val 1: only, Val 2: also Degree • Val 0: normal, Val 1: superlative, Val 2: Comparative Gender • Val 0: masculine Val 1: feminine Val 2: neuter
Adverb Morphology Root • Val 0: root word of the given surface form of the word Emphasizer • Val 0: none, Val 1: only, Val 2: also Degree • Val 0: normal, Val 1: superlative, Val 2: Comparative
Postposition Morphology Root • Val 0: root word of the given surface form of the word Emphasizer • Val 0: none, Val 1: only, Val 2: also
Morphological Generator Developed at IIT Kharagpur
Introduction Morphological Generator uses certain linguistic resources and generates the surface form from a given input. The following linguistic resources are required • Root Dictionary • Morphological Rules • Rule/Attribute Type Declaration (RATD) • Morphotactics • Paradigm Tables • Orthographic Rewrite Rules • Exception List
Format of the root dictionary <root_word>:<category, paradigm_no;>+ • root_word: The root word in UTF-8 • category: Part-of-speech category • paradigm_no: A specific non-negative number referring to the paradigm table to be used for generation of the surface form for the root_word, when used as a particular POS-category. • +: denotes one or more occurrence of the <category, paradigm_no;> Example for Hindi: • कर: NN,0; VM,1; • आम: NN,1; JJ, 0;
The first line of the RATD is <#categories> <cat_tag >+ #categories: The total number of distinct categories, for which morphological generation is required. cat_tag: The category tag as used in the root dictionary, for which the generation is required. Example: 3 NN QC VM RATD
RATD This is followed by the declarations related to the #categories categories. The declaration for each category consists of meta declaration line followed by #morphotactics lines specifying the morphotactic rules. The meta declaration for a category is as follows: <cat_tag> <file_name> <#paradigms> <#morphotactics><#attributes> <#values_for_attribute>+ • cat_tag: As defined above • file_name: The name of the file that contains the morphotactics, paradigm tables and rewrite rules of the particular category. • #paradigms: Total number of paradigms for the category • #morphotactics: Total number of linear morphotactic rules for the category • #attributes: Total number of attributes that govern the morphology • #values_for_attribute: The number of values for each of the attributes. Example NN nn.txt 5 1 2 2 2
Morphotactics The morphotactics are specified linearly in the following format { ‘(’ { attribute_id, }+ ‘)’ }+ • For example, the morphotactic rule (0, 2)(3)(1, 4) means that the suffix marking for the features 0 and 2 is followed by the suffix marking feature 3 and then the suffix marking the features 1 and 4. • We assume a linear morphology • We assume that inflections are in the form of suffixes only (i.e. no prefix or infix) • In the above example, it is not possible to split the suffixes marking for features 0 and 2, and 1 and 4. In other words, the suffixes for these features are fusional as far as (0,2) or (1,4) feature combinations are considered, but the morphology is agglutinative in general. • There can be more than one morphotactic rule for a category in a language. In that case, the first rule is taken as the default one, whereas the other rules are triggered only under special circumstances, which are to be specified with the rule by assigning some specific value to the feature, like (0, 2=5)(3)(1, 4) implies that the rule is triggered only when Attribute 2 has a value of 5.
Morphotactics example • Bengali noun morphology • Attribute 0: Number Val 0: singular, Val 1: plural • Attribute 1: Obliqueness Val 0: direct, Val 1: oblique • Attribute 2: Specificity Val 0: non-specific, Val 1: specific • Attribute 3: Case Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative • Attribute 4: Emphasizer Val 0: none, Val 1: only, Val 2: also • Attribute 5: Ellipses Val 0: false, Val 1: true Bengali nouns follow one of the following two morphotactics • (0,1,2)(3)(4) • (0,1,2)(5=1)(0,1,2)(3)(4) The second rule is triggered only in the case of ellipses.
Paradigm Table • The category specific files (e.g. nn.txt in the earlier example) store the paradigm tables and orthographic rewrite rules. • There are paradigm tables corresponding to every paradigm number for each of the feature/feature-combination in the morphotactics. Thus, if there are #paradigms for Bengali nouns, then there are 4*#paradigms paradigm tables. The 4 tables per paradigm corresponds to (0,1,2), (3), (4), and (5). • However, several paradigms might share some of the tables. Therefore, in the declaration, a particular table can stand for more than one paradigm.
Paradigm table contains the list of suffices for a particular combination of attributes. <ParadigmTable <Attributes a1, a2> <ParadigmNumber x1, x2, x3> <Suffixes s11, s12, s13,…, s21, s22, s23,…> The Number of suffices in a table is equal to the multiplication of the values of the attributes in that combination. Example: If the combination is (0,1) and 1st attribute has 10 values and 2nd attribute has 3 values, the table for the combination (0,1) will contain 10×3 = 30 suffices (may be some of them are NULL).
Orthographic Rules Orthographic rules are specified as rewrite rules of the following forms input output / left_context, right_context We also have provisions to specify two layer rules, where on the top layer specifies the rule on strings, and on the bottom layer, the features are indicated. Thus, a rule of type input output / left_context, right_context [att1] [root], [att2] means that when the suffix corresponding to the attribute att1 has the pattern input, and it is immediately preceded by the pattern left_context, which belongs to the root and followed by the pattern right_context, which belongs to another suffix corresponding to some attribute att2, then input should be replaced by the pattern output.
RATD for Bengali • 11 NN QC VM PN AV AJ PS OT UT QF QO • NN nn.txt nn_rule.txt mean_noun.txt 1 1 6 2 2 2 2 5 3 • QC qc.txt qc_rule.txt mean_card.txt 1 1 4 4 2 2 3 • VM vm.txt vm_rule.txt mean_verb.txt 1 2 5 6 10 3 2 2 • PN pn.txt pn_rule.txt mean_pron.txt 1 2 7 2 2 2 2 2 5 3 • AV av.txt av_rule.txt mean_adv.txt 1 1 2 3 3 • AJ aj.txt aj_rule.txt mean_adj.txt 1 1 2 3 3 • PS ps.txt ps_rule.txt mean_psp.txt 1 1 1 3 • OT ot.txt ot_rule.txt mean_oth.txt 1 1 1 1 • UT ut.txt ut_rule.txt mean_quot.txt 1 1 1 3 • QF qf.txt qf_rule.txt mean_quan.txt 1 1 2 2 3 • QO qo.txt qo_rule.txt mean_ord.txt 1 1 1 3 • symbols: aAbcdDeghiIjklmn.;NoprsStTuUyY
Orthographic Rules The format is similar to two level morphological rules. Each rule has 4 parts input:output/left_context,right_context Here input is changed to output provided left_context is preceded by and right_context is followed by input. Suffix is ended by #. Example: “give^ing# = giving” can be written by the rule Rule 1: e^:NULL/giv,ing# If we say all “e-ending” words are inflected like “give” then we can write the rule Rule 2: e^:NULL/*,ing# If we say all “a-ending” and “o-ending” words are simply concatenated when added with “ing#” we can write Rule 3: ^:NULL/*~,ing# (Where ~ symbol means either ‘a’ or ‘o’)
Orthographic Rules Contd.. The Orthographic rules are best designed by FSM (Deterministic). FSM will help to decide whether the rule is satisfied by the input word. If “yes” finding out the portion to be replaced is not very tricky. If no Orthographic rule is triggered suffix is simply concatenated. If following the FSM, input word reach the final state, we say the rule is triggered.
*-e e i n g * e ^ # G S A B C D E F *-e-^ *-n *-g *-# *-i * H Building FSM Example FSM for Rule 2: e^:NULL/*,ing#
Orthographic Rules for Bengali Verb • oYA^L:o/*A,* • no^e:Ya/*A,K • oYA^e:Ya/*A,K • AoYA^:eY/X,echh* • yAoYA^:giY/*,echh* • AoYA^:e/X,M* • oYA^:NULL/*A,b* • AoYA^:e/y,t* • yAoYA^:ge/*,l* • eoYA^:iY/*,echh* • eoYA^:i;/*,iK • eoYA^:ich/*,chh* • eoYA^:i/*,P* • oYA^:NULL/*e,Q* • eoYA^u:i/*,* • eoYA^:NULL/*,R* • eoYA^:A/*,o* • eoYA^a:Ao/*,ni • eoYA^e:eYa/*,K • oYA^:uY/*$,echh* • oYA^:uch/*$,chh* • oYA^:u/*$,V* • oYA^i:u/*$,sa* • YA^:;/*$o,o# • YA^a:;o/*$o,ni* • YA^e:NULL/*$o,naK • YA^u:NULL/*$o,* • A^e:a/*$oY,K • ^y:;i/*,# • ^a:;o/*,# • AWA^aie:eWe/*,# • yAoYA^aie:giYe/*,* • AoYA^aie:eYe/X,# • eoYA^aie:iYe/*,# • Ano^aie:iYe/*,# • oYA^aie:uYe/*$,# • ^y:;i/*,# • ^a:;o/*,# • AWA^aie:eWe/*,# • A^aie:e/*,# • yAoYA^aie:giYe/*,* • AoYA^aie:eYe/X,# • eoYA^aie:iYe/*,# • Ano^aie:iYe/*,# • oYA^aie:uYe/*$,# • AWA^aie:eWe/*,# • A^:NULL/B,~* • A^:a/B,$* • no^:ch/*A,chh* • oYA^:ch/*A,chh* • Ano^:iY/*,echh* • no^:NULL/*A,E* • no^F:NULL/*A,G* • oYA^F:NULL/*A,G* • no^:NULL/*A,iK • oYA^:NULL/*A,iK • no^L:o/*A,*
Input Format Input to the Morphological Generator is started with the root of the word followed by the POS Category and Attribute names and their values. Example: karA VM Person 3 Tense 2 Emp 2 In Bengali Person and Tense combine to give a suffix which will be added first and Emphasizer will give another suffix which will be added next. See Morphotactic for Bengali Verb.
Input Format Contd. In Bengali, Person can have 6 values and Tense (which is actually TAM) can have 10 values. The suffices In the Paradigm table is arranged in the following way. First entry is Person 0 Tense 0 Second entry is Person 0 Tense 1 Third entry is Person 0 Tense 2 … 10th entry is Person 0 Tense 9 11th entry is Person 1 Tense 0 So Person 3 Tense 2 will be the entry number (Person input) × (TAM value) + TAM input +1 = 3 × 10 + 2 + 1 = 33 Get 33rd entry from the Paradigm table for (0,1) and use the Orthographic rule to get the correct word.
Bengali Verb Paradigms and Morphotactics <ParadigmTable <Attributes 1 2 > /* 1 indicates Person and 2 indicates TAM */ <suffixes i chhi echhi lAma chhilAma echhilAma ba tAma NULL ini isa chhisa echhisa li chhili echhili bi tisa NULL isani o chha echha le chhile echhile be te NULL ani ena chhena echhena lena chhilena echhilena bena tena una enani e chhe echhe la chhila echhila be ta uka eni ena chhena echhena lena chhilena echhilena bena tena una enani >> <ParadigmTable <Attributes 3 > /*Case*/ <suffixes NULL i o >> Morphotactic rule • (0,1)(2)(3) • (3=2)(2)
Bengali Noun Paradigms and Morphotactics <ParadigmTable <Attributes 0 1 2 > /* Number, Specificity, Ellipses 2×2×2 = 8 entries*/ <suffixes NULL eraTA TA NULL gulo guloraTA NULL NULL >> <ParadigmTable <Attributes 3 4 > /* Form, Case 2 × 5 = 10 entries */ <suffixes NULL ke NULL ete ete NULL NULL era NULL NULL >> <ParadigmTable <Attributes 5 > /* Emphasizer 3 entries */ <suffixes NULL i o >> Morphotactic rule (0,1,2)(3,4)(5)
Example (Bengali Verb) Example: the Input is balA Verb Person 1 TAM 1 Case 0 First Morphotactic rule is triggered. Person can have 6 values and TAM can have 10 values. So the extracted suffix number from the paradigm table 1,2 is 10×(Person value) +(TAM value) + 1 = 10×1 + 1 + 1 = 12 i.e., chhisais to be added first. From the paradigm table (3) extracted suffix is NULL. i.e., NULL is to be added next.
Example Contd. Now balA^chhisa# is the input which will search for suitable Orthographic rule. Suppose there is an orthographic rule A^:a/B,$* Where B:*-Y and $: consonant Then the FSM for this rule will bring the input to the final state. i.e., the rule is triggered. Now “A^” is replaced by “a” and the output is “balachhisa”
Exception List: Some words which do not match with other words in the orthographic change on those which are changed completely when inflected are said to be exceptions. Those words if added in Orthographic rule will cause a large number of rules with a huge complexity. We handled those words mentioning in a separate file which include the exception words along with all its inflections.