1 / 20

Intuitive Coding of the Arabic Lexicon

Intuitive Coding of the Arabic Lexicon. Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France. Purpose. To report on SYSTRAN’s experience in building an Arabic monolingual dictionary as a component of SYSTRAN’s Arabic-English Machine Translation System

dara
Download Presentation

Intuitive Coding of the Arabic Lexicon

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France

  2. Purpose • To report on SYSTRAN’s experience in building an Arabic monolingual dictionary as a component of SYSTRAN’s Arabic-English Machine Translation System • To describe the methodology and implementation adopted for dictionary building and morphological analysis

  3. Overview • SYSTRAN’s Arabic-English MT System • SYSTRAN’s Intuitive Coding Technology • Intuitive Coding of the Arabic Lexicon • Stem-based • Statistical Arabic stem Generation • Internal morphology • External morphology

  4. SYSTRAN’s Arabic-English MT System • An end to end MT System • Development started July 2002 • Using SYSTRAN’s NG technology • Declarative modules • State of the art Arabic linguistic knowledge • Transfer approach • Hybrid approach combining Statistical techniques and linguistic knowledge

  5. SYSTRAN’s Intuitive Coding Technology • Customizing MT systems to improve translation quality • Building user specific dictionaries - by the developers - by the user - collaboration • SYSTRAN’s decision: • Let the user do the customization

  6. Intuitive Coding • (Senellart et al, 2003) • Dictionary representation should be simple • Automatic processing of user information • Interactive processing • Multi level coding algorithm • Complete integration • Easy to use Graphic Interface

  7. Stem Based Arabic lexicon • Following the spirit of Senellart (2003), we opted for intuitive coding of the Arabic lexicon: • What are the building blocks of the Arabic dictionary? • A – roots • B - stems

  8. Why Stems? • Stems are more intuitive than roots • Eliminates the need for morphological patterns “الميزان الصرفي” • Eliminates overgeneralization of Arabic stems • Subcategorization frames, syntactic and semantic information are stem-specific and not root-specific

  9. Sample Entry • 1016 إِنْتَصَرَ verb plain "[perfect=إنْتَصَرَ],[imperfect=ينْتَصِرُ],[passper=إنْتُصِر],[imperative=إنْتَصِر],[passimp=ينْتَصَر]" [+AINT+GPP+HUSUBJ]

  10. Statistical Arabic Stem Generator • To reduce amount of typing • To speed up entry creation • 60% increase of productivity of lexicographers • Uses morphological rules that are most productive

  11. Generator Output • [perfect= قال],[imperfect=يَقال],[imperative=إقال],[passperf=قال],[passimperf=يقال] • [perfect= • كَتَبَ],[imperfect=يَكْتُب],[imperative=أُكْتُب],[passperf=كُتِبَ],[passimperf=يُكْتَب]

  12. Arabic Morphology • SYSTRAN has two different modules: • 1. Internal Morphology • 2. External Morphology • Two separate modules in a feeding order

  13. Internal Morphology Module • Generates all different inflected forms of a given stem and adds morphological information to be used in syntactic processing

  14. The Input to Internal Morphology Module • Input: Two files: • 1. stem files • 2. Morphological Rules file • Output • Inflected Dictionary file

  15. Sample of output • كتبن verb plain كتب+past+fem+3P+plural

  16. Syntagmatic and Paradigmatic (Halliday 1972) Morphology Internal همشاهد و ني يشاهد ف External ها سيشاهد ه يشاهدون ل هن نشاهد

  17. External Morphology Module • Decomposes a token into different part-of-speech units • Follows morphosyntactic rules of the language • It is the syntax of morphemes • It has morphophonemic component

  18. Sample of External Morphology Rules • WAFA:= <وَ.CONJ|فَ.CONJ> • KABILI:= <كَ.PREP|بِ.PREP|لِ.PREP> • LI:= <لِ.PREP> • {WAFA}?_{AL}_<NOUN:-PROPERNOUN|ADJ |DET:QUANTIFIER|NUMERIC:CARDINAL>{WAFA}?_{NOUNADJ}_<PRON:PERSPOSS>{WAFA}?_{KABILI}_{NOUNADJ}_<PRON:PERSPOSS>

  19. Order of Application • The External morphology has to apply before the internal morphology and the lookup in the mono inflected dictionary • Thus we can say that the output of the external morphology module feeds the internal morphology

  20. Conclusion • SYSTRAN’s monolingual dictionary has about 30,000 entries • Coverage of newspapers’ discourse is over 90% • The approach outlined in this paper has greatly accelerated development • Analysis, homograph resolution and transfer rules are being added and implemented.

More Related