1 / 16

A knowledge rich morph analyzer for Marathi derived forms

A knowledge rich morph analyzer for Marathi derived forms. Ashwini Vaidya IIIT Hyderabad. Need for Morphological analysis. Basic information about a word’s category, gender, number etc. is provided by morph analysis Required for Machine Translation tasks

Faraday
Download Presentation

A knowledge rich morph analyzer for Marathi derived forms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad

  2. Need for Morphological analysis • Basic information about a word’s category, gender, number etc. is provided by morph analysis • Required for Machine Translation tasks • Necessary for building part-of-speech taggers • Accurate tools are especially required for languages that are morphologically rich

  3. Inflectional and Derivational forms • To begin with, morph analysis concentrates on inflectional forms. • Inflection more regular and productive. Eg. A plural affix would attach to almost all nouns, but a derivational affix like –ness only to a few • Criteria of attachment is more difficult to determine for a derivational affix

  4. Computational analysis of derived forms • Previous approaches have used strategies such as • Creation of suffix table (Hoeppner, 1982) • Identifying morphologically ‘active’ bases (Byrd, 1986) • Using an extensive semantic ontology (Woods, 2000) • Statistical approaches have focused on automatic acquisition of morphology (eg. Sharma et al for Assamese)

  5. Productivity of Derivational suffixes • Survey of some noun-forming affixes in the CIIL Marathi corpus showed how some occur more frequently than others • Analysis of such suffixes would capture some linguistic knowledge • -pəɳa, -ɪkə, -t̪a, -iː, attach more freely • Suffixes like -ɪkərəɳə, -gɪri, -əɳə are less frequent

  6. Marathi morph analysis • Existing Morph analyzer by Akshar Bharti • 114 paradigms for nouns, verbs, pronouns, adjectives • Derivational and inflectional processes operate together, hence both kinds of knowledge needed • Open source tool Lttoolbox allows for easy conversion/creation of new paradigms

  7. Building a morphological dictionary • The Lttoolbox tool requires the creation of a set of correspondences between Surface Forms and Lexical forms • Surface forms (SF) : forms that have undergone some morphological process • Lexical forms (LF) : base forms of the words, entered in the dictionary • Regularities in this correspondences form paradigms • Morph analysis will take SF as input and return LF as the output • Generation, i.e. vice versa is also possible

  8. Sample paradigm <pardef n = “rasw/A__n”> <e> <p> <l>A</l> <r>A</l><s n = “nm”/><s n = “sg”/><s n = “parsarg:0”/></r> </p> </e> <e> <p> <l>yAlA</l> <r>A</l><s n = “nm”/><s n = “sg”/><s n= “parsarg:lA”/></r> </p> </e> </pardef> Dictionary entry: <e lm =“kacarA”><i>kacar</i><par n =“rasw/A__n”/>

  9. Adding knowledge about derivational suffixes • The sample paradigm given below is used to call another paradigm containing information about the derivational suffix <pardef n = “lahAna/__a”> [lahAna=ləhanə, small, adj] <e> <p> <l></l> <r></l><s n = “adj”/></r> </p> </e> <e> <p> </l> </r> </p> <par n= “D__paNA”> </e> </pardef>

  10. Nested paradigm • The paNA paradigm is ‘called’ from the previous one: <pardef n = “D__/paNA”> <e> <p> <l>paNA</l> <r>paNA</l><s n = “nm”/> ><s n = “number:eka”/> <s n = “rcat:n”/><s n = “suff:paNA”/> ”/><s n= “parsarg: 0”/></</r> </p> </e> <e> <p> <l>paNAne</l> <r>paNA</l><s n = “nm”/> ><s n = “number:eka”/><s n = “rcat:n”/><s n = “suff:paNA”/><s n= “parsarg: ne”/></r> </p> </e> </pardef>

  11. Sample Output • lahAna/lahAna<adj> • lahanapaNA/lahAnapaNA<n><m><rcat:adj><suff:paNA><parsarg:0> • lahAnapaNAne/lahAnapaNA<n><m><rcat:adj><suff:paNA><parsarg:ne>

  12. More features • Possible to call more than one paradigm at a time. • Example, lahAna can take -paNA or –paNa <pardef n = “lahAna/__a”> <e> <p> <l></l> <r></l><s n = “adj”/></r> </p> </e> <e><p></l></r> </p> <par n= “D__paNA”> </e> <e><p></l></r> </p> <par n= “D__paNa”> </e> </pardef>

  13. Present Work • The morphological dictionary consists of 10 derivational suffixes in Marathi • 38 derivational paradigms • Total number of forms generated: 450,000 • Preliminary evaluation over a set of 200 derived forms taken from a corpus shows 32% coverage

  14. Problems • Coverage can be improved if the following issues can be handled: • Prefixes: needs further processing • Cases of ‘Vriddhi’ cannot be handled well using paradigms. Example: pəʋit̪rə+yə =paʋit̪ryə (pure + suf = purity) • Emphatic particles like –hI and -ca • Some noun forming suffixes like –Ne or –ArI are highly regular, hence better handled using an inflectional paradigm

  15. Future work • Aim at increasing coverage by addition of more suffixes • Test the possibility of using ‘Metadix’ for handling cases of vowel lengthening

  16. Download and documentation for Lttoolbox: • <http://wiki.apertium.org/wiki/Main_Page> • SourceForge

More Related