160 likes | 345 Views
A knowledge rich morph analyzer for Marathi derived forms. Ashwini Vaidya IIIT Hyderabad. Need for Morphological analysis. Basic information about a word’s category, gender, number etc. is provided by morph analysis Required for Machine Translation tasks
E N D
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad
Need for Morphological analysis • Basic information about a word’s category, gender, number etc. is provided by morph analysis • Required for Machine Translation tasks • Necessary for building part-of-speech taggers • Accurate tools are especially required for languages that are morphologically rich
Inflectional and Derivational forms • To begin with, morph analysis concentrates on inflectional forms. • Inflection more regular and productive. Eg. A plural affix would attach to almost all nouns, but a derivational affix like –ness only to a few • Criteria of attachment is more difficult to determine for a derivational affix
Computational analysis of derived forms • Previous approaches have used strategies such as • Creation of suffix table (Hoeppner, 1982) • Identifying morphologically ‘active’ bases (Byrd, 1986) • Using an extensive semantic ontology (Woods, 2000) • Statistical approaches have focused on automatic acquisition of morphology (eg. Sharma et al for Assamese)
Productivity of Derivational suffixes • Survey of some noun-forming affixes in the CIIL Marathi corpus showed how some occur more frequently than others • Analysis of such suffixes would capture some linguistic knowledge • -pəɳa, -ɪkə, -t̪a, -iː, attach more freely • Suffixes like -ɪkərəɳə, -gɪri, -əɳə are less frequent
Marathi morph analysis • Existing Morph analyzer by Akshar Bharti • 114 paradigms for nouns, verbs, pronouns, adjectives • Derivational and inflectional processes operate together, hence both kinds of knowledge needed • Open source tool Lttoolbox allows for easy conversion/creation of new paradigms
Building a morphological dictionary • The Lttoolbox tool requires the creation of a set of correspondences between Surface Forms and Lexical forms • Surface forms (SF) : forms that have undergone some morphological process • Lexical forms (LF) : base forms of the words, entered in the dictionary • Regularities in this correspondences form paradigms • Morph analysis will take SF as input and return LF as the output • Generation, i.e. vice versa is also possible
Sample paradigm <pardef n = “rasw/A__n”> <e> <p> <l>A</l> <r>A</l><s n = “nm”/><s n = “sg”/><s n = “parsarg:0”/></r> </p> </e> <e> <p> <l>yAlA</l> <r>A</l><s n = “nm”/><s n = “sg”/><s n= “parsarg:lA”/></r> </p> </e> </pardef> Dictionary entry: <e lm =“kacarA”><i>kacar</i><par n =“rasw/A__n”/>
Adding knowledge about derivational suffixes • The sample paradigm given below is used to call another paradigm containing information about the derivational suffix <pardef n = “lahAna/__a”> [lahAna=ləhanə, small, adj] <e> <p> <l></l> <r></l><s n = “adj”/></r> </p> </e> <e> <p> </l> </r> </p> <par n= “D__paNA”> </e> </pardef>
Nested paradigm • The paNA paradigm is ‘called’ from the previous one: <pardef n = “D__/paNA”> <e> <p> <l>paNA</l> <r>paNA</l><s n = “nm”/> ><s n = “number:eka”/> <s n = “rcat:n”/><s n = “suff:paNA”/> ”/><s n= “parsarg: 0”/></</r> </p> </e> <e> <p> <l>paNAne</l> <r>paNA</l><s n = “nm”/> ><s n = “number:eka”/><s n = “rcat:n”/><s n = “suff:paNA”/><s n= “parsarg: ne”/></r> </p> </e> </pardef>
Sample Output • lahAna/lahAna<adj> • lahanapaNA/lahAnapaNA<n><m><rcat:adj><suff:paNA><parsarg:0> • lahAnapaNAne/lahAnapaNA<n><m><rcat:adj><suff:paNA><parsarg:ne>
More features • Possible to call more than one paradigm at a time. • Example, lahAna can take -paNA or –paNa <pardef n = “lahAna/__a”> <e> <p> <l></l> <r></l><s n = “adj”/></r> </p> </e> <e><p></l></r> </p> <par n= “D__paNA”> </e> <e><p></l></r> </p> <par n= “D__paNa”> </e> </pardef>
Present Work • The morphological dictionary consists of 10 derivational suffixes in Marathi • 38 derivational paradigms • Total number of forms generated: 450,000 • Preliminary evaluation over a set of 200 derived forms taken from a corpus shows 32% coverage
Problems • Coverage can be improved if the following issues can be handled: • Prefixes: needs further processing • Cases of ‘Vriddhi’ cannot be handled well using paradigms. Example: pəʋit̪rə+yə =paʋit̪ryə (pure + suf = purity) • Emphatic particles like –hI and -ca • Some noun forming suffixes like –Ne or –ArI are highly regular, hence better handled using an inflectional paradigm
Future work • Aim at increasing coverage by addition of more suffixes • Test the possibility of using ‘Metadix’ for handling cases of vowel lengthening
Download and documentation for Lttoolbox: • <http://wiki.apertium.org/wiki/Main_Page> • SourceForge