200 likes | 367 Views
Intuitive Coding of the Arabic Lexicon. Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France. Purpose. To report on SYSTRAN’s experience in building an Arabic monolingual dictionary as a component of SYSTRAN’s Arabic-English Machine Translation System
E N D
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France
Purpose • To report on SYSTRAN’s experience in building an Arabic monolingual dictionary as a component of SYSTRAN’s Arabic-English Machine Translation System • To describe the methodology and implementation adopted for dictionary building and morphological analysis
Overview • SYSTRAN’s Arabic-English MT System • SYSTRAN’s Intuitive Coding Technology • Intuitive Coding of the Arabic Lexicon • Stem-based • Statistical Arabic stem Generation • Internal morphology • External morphology
SYSTRAN’s Arabic-English MT System • An end to end MT System • Development started July 2002 • Using SYSTRAN’s NG technology • Declarative modules • State of the art Arabic linguistic knowledge • Transfer approach • Hybrid approach combining Statistical techniques and linguistic knowledge
SYSTRAN’s Intuitive Coding Technology • Customizing MT systems to improve translation quality • Building user specific dictionaries - by the developers - by the user - collaboration • SYSTRAN’s decision: • Let the user do the customization
Intuitive Coding • (Senellart et al, 2003) • Dictionary representation should be simple • Automatic processing of user information • Interactive processing • Multi level coding algorithm • Complete integration • Easy to use Graphic Interface
Stem Based Arabic lexicon • Following the spirit of Senellart (2003), we opted for intuitive coding of the Arabic lexicon: • What are the building blocks of the Arabic dictionary? • A – roots • B - stems
Why Stems? • Stems are more intuitive than roots • Eliminates the need for morphological patterns “الميزان الصرفي” • Eliminates overgeneralization of Arabic stems • Subcategorization frames, syntactic and semantic information are stem-specific and not root-specific
Sample Entry • 1016 إِنْتَصَرَ verb plain "[perfect=إنْتَصَرَ],[imperfect=ينْتَصِرُ],[passper=إنْتُصِر],[imperative=إنْتَصِر],[passimp=ينْتَصَر]" [+AINT+GPP+HUSUBJ]
Statistical Arabic Stem Generator • To reduce amount of typing • To speed up entry creation • 60% increase of productivity of lexicographers • Uses morphological rules that are most productive
Generator Output • [perfect= قال],[imperfect=يَقال],[imperative=إقال],[passperf=قال],[passimperf=يقال] • [perfect= • كَتَبَ],[imperfect=يَكْتُب],[imperative=أُكْتُب],[passperf=كُتِبَ],[passimperf=يُكْتَب]
Arabic Morphology • SYSTRAN has two different modules: • 1. Internal Morphology • 2. External Morphology • Two separate modules in a feeding order
Internal Morphology Module • Generates all different inflected forms of a given stem and adds morphological information to be used in syntactic processing
The Input to Internal Morphology Module • Input: Two files: • 1. stem files • 2. Morphological Rules file • Output • Inflected Dictionary file
Sample of output • كتبن verb plain كتب+past+fem+3P+plural
Syntagmatic and Paradigmatic (Halliday 1972) Morphology Internal همشاهد و ني يشاهد ف External ها سيشاهد ه يشاهدون ل هن نشاهد
External Morphology Module • Decomposes a token into different part-of-speech units • Follows morphosyntactic rules of the language • It is the syntax of morphemes • It has morphophonemic component
Sample of External Morphology Rules • WAFA:= <وَ.CONJ|فَ.CONJ> • KABILI:= <كَ.PREP|بِ.PREP|لِ.PREP> • LI:= <لِ.PREP> • {WAFA}?_{AL}_<NOUN:-PROPERNOUN|ADJ |DET:QUANTIFIER|NUMERIC:CARDINAL>{WAFA}?_{NOUNADJ}_<PRON:PERSPOSS>{WAFA}?_{KABILI}_{NOUNADJ}_<PRON:PERSPOSS>
Order of Application • The External morphology has to apply before the internal morphology and the lookup in the mono inflected dictionary • Thus we can say that the output of the external morphology module feeds the internal morphology
Conclusion • SYSTRAN’s monolingual dictionary has about 30,000 entries • Coverage of newspapers’ discourse is over 90% • The approach outlined in this paper has greatly accelerated development • Analysis, homograph resolution and transfer rules are being added and implemented.