A knowledge rich morph analyzer for Marathi derived forms

A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad

Need for Morphological analysis • Basic information about a word’s category, gender, number etc. is provided by morph analysis • Required for Machine Translation tasks • Necessary for building part-of-speech taggers • Accurate tools are especially required for languages that are morphologically rich

Inflectional and Derivational forms • To begin with, morph analysis concentrates on inflectional forms. • Inflection more regular and productive. Eg. A plural affix would attach to almost all nouns, but a derivational affix like –ness only to a few • Criteria of attachment is more difficult to determine for a derivational affix

Computational analysis of derived forms • Previous approaches have used strategies such as • Creation of suffix table (Hoeppner, 1982) • Identifying morphologically ‘active’ bases (Byrd, 1986) • Using an extensive semantic ontology (Woods, 2000) • Statistical approaches have focused on automatic acquisition of morphology (eg. Sharma et al for Assamese)

Productivity of Derivational suffixes • Survey of some noun-forming affixes in the CIIL Marathi corpus showed how some occur more frequently than others • Analysis of such suffixes would capture some linguistic knowledge • -pəɳa, -ɪkə, -t̪a, -iː, attach more freely • Suffixes like -ɪkərəɳə, -gɪri, -əɳə are less frequent

Marathi morph analysis • Existing Morph analyzer by Akshar Bharti • 114 paradigms for nouns, verbs, pronouns, adjectives • Derivational and inflectional processes operate together, hence both kinds of knowledge needed • Open source tool Lttoolbox allows for easy conversion/creation of new paradigms

Building a morphological dictionary • The Lttoolbox tool requires the creation of a set of correspondences between Surface Forms and Lexical forms • Surface forms (SF) : forms that have undergone some morphological process • Lexical forms (LF) : base forms of the words, entered in the dictionary • Regularities in this correspondences form paradigms • Morph analysis will take SF as input and return LF as the output • Generation, i.e. vice versa is also possible

Sample paradigm <pardef n = “rasw/A__n”> <e> <l>A</l> <r>A</l><s n = “nm”/><s n = “sg”/><s n = “parsarg:0”/></r> </e> <e> <l>yAlA</l> <r>A</l><s n = “nm”/><s n = “sg”/><s n= “parsarg:lA”/></r> </e> </pardef> Dictionary entry: <e lm =“kacarA”>kacar<par n =“rasw/A__n”/>

Adding knowledge about derivational suffixes • The sample paradigm given below is used to call another paradigm containing information about the derivational suffix <pardef n = “lahAna/__a”> [lahAna=ləhanə, small, adj] <e> <l></l> <r></l><s n = “adj”/></r> </e> <e> </l> </r> <par n= “D__paNA”> </e> </pardef>

Nested paradigm • The paNA paradigm is ‘called’ from the previous one: <pardef n = “D__/paNA”> <e> <l>paNA</l> <r>paNA</l><s n = “nm”/> ><s n = “number:eka”/> <s n = “rcat:n”/><s n = “suff:paNA”/> ”/><s n= “parsarg: 0”/></</r> </e> <e> <l>paNAne</l> <r>paNA</l><s n = “nm”/> ><s n = “number:eka”/><s n = “rcat:n”/><s n = “suff:paNA”/><s n= “parsarg: ne”/></r> </e> </pardef>

Sample Output • lahAna/lahAna<adj> • lahanapaNA/lahAnapaNA<n><m><rcat:adj><suff:paNA><parsarg:0> • lahAnapaNAne/lahAnapaNA<n><m><rcat:adj><suff:paNA><parsarg:ne>

More features • Possible to call more than one paradigm at a time. • Example, lahAna can take -paNA or –paNa <pardef n = “lahAna/__a”> <e> <l></l> <r></l><s n = “adj”/></r> </e> <e></l></r> <par n= “D__paNA”> </e> <e></l></r> <par n= “D__paNa”> </e> </pardef>

Present Work • The morphological dictionary consists of 10 derivational suffixes in Marathi • 38 derivational paradigms • Total number of forms generated: 450,000 • Preliminary evaluation over a set of 200 derived forms taken from a corpus shows 32% coverage

Problems • Coverage can be improved if the following issues can be handled: • Prefixes: needs further processing • Cases of ‘Vriddhi’ cannot be handled well using paradigms. Example: pəʋit̪rə+yə =paʋit̪ryə (pure + suf = purity) • Emphatic particles like –hI and -ca • Some noun forming suffixes like –Ne or –ArI are highly regular, hence better handled using an inflectional paradigm

Future work • Aim at increasing coverage by addition of more suffixes • Test the possibility of using ‘Metadix’ for handling cases of vowel lengthening

Download and documentation for Lttoolbox: • <http://wiki.apertium.org/wiki/Main_Page> • SourceForge

A knowledge rich morph analyzer for Marathi derived forms

A knowledge rich morph analyzer for Marathi derived forms

Presentation Transcript

Morphology 2 A case study of developing Bengali morph analyzer and generator

MARATHI WORDNET

Nokia Morph

The MORPH Algorithm

Future Morph: A Introduction

Forms of Robot Knowledge Acquisition

MORPH-form

Morph Root

Morph

Parallel Algorithms: Morph

Wikitology: A Wikipedia Derived Knowledge Base

Derived Verb Forms

Morph Taxonomy

A RICH detector for CLAS12

Marathi – Marathi Monolingual Information Retrieval

My Marathi

and Morph

Marathi Karaoke Songs – Download Marathi Folk Songs

ORGAN DONATION ( MARATHI )

Presentation PowerPoint - table - morph

Marathi Matrimony Website

A RICH detector for CLAS12