250 likes | 411 Views
Towards Developing a Multi-Dialect Morphological Analyser for Arabic. 4 th International Conference on Arabic Language Processing May 2–3, 2012, Rabat, Morocco. Khalid Almeman and Mark Lee The University of Birmingham www.almeman.com. Outline. Introduction
E N D
Towards Developing a Multi-Dialect Morphological Analyser for Arabic 4th International Conference on Arabic Language Processing May 2–3, 2012, Rabat, Morocco Khalid Almeman and Mark Lee The University of Birmingham www.almeman.com
Outline • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work
Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work
Introduction The usage of: MSA vs. Dialects
Introduction • Dialectal Morphology & Variation • Arabic MSA has a rich morphology in two main aspects: • Affixes and stems (word level) • Syntax (context level) • Dialects have MSA complex and also the big change between MSA and the dialects in both word and syntax levels
Introduction • Dialectal Morphology & Variation (the changes) • Transforming in some phonetics • e.g. s to h (N Africa) , q to a (LEV), s to H (EGY) • New phonetics • e.g. k to ts or ch(Gulf), j to g (EGY) • The changes in syntax between MSA and dialects • No standardisation in writing • e.g. a loanword ‘sandwich’ can be represented in many forms;
Introduction • Dialectal Morphology & Variation (the changes)the changes in phonetics between arabic dialects comparing with msa e.g.
Introduction What is the problem: • The rich morphology in Arabic language • The variety between MSA and dialects • The variety between dialects themselves • No standardisation in Arabic dialects. • State of the art: MAGEAD • Restricted to verbs • Levantine – need to define rules for new dialects So, the need of dialects morphology analyser
Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work
Multi dialect morphology analyser • Three methods have been applied: • Modify MSA analyser • Segment the rest of words • Check the frequency in the web corpus 1 2 3
Multi dialect morphology analyser • Baseline experiment We have extracted 2229 dialects words from the web and then checked them in MSA morphology analyser (Al Khalil, 2011) the result
Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work
Multi dialect morphology analyser The first method: Adopt MSA analyser According to Haack (1996) the stem patterns of Arabic dialects are identical to those of MSA in many cases So the suggestion is to add NEW dialects affixes to MSA morphology analyser
The Results after the first layer: An example of output after first layer
Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work
Multi dialect morphology analyser The second method: the segmenter Segments the rest of words by extracting four shapes of the word yet; we do not know which one is the correct?
Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work
Multi dialect morphology analyser The third method: Use web corpus الواد حيلعب الكورة وولدي ما بيلعبكمان ولدي ماهيلعب FULL WORD usage ---- DISAGREED Between Arab countries in many cases However, STEM usage ---- AGREED Between Arab countries in many cases ابني بيحب يلعب الكورة ولدي مابيحبش يلعب الكورة وولدي مابيحب يلعب الكورة كمان So
Multi dialect morphology analyser The third method (cont.) According to a hypothesis: We will check the frequency in the web corpus; Full Word:بيصطاد (16500) Prefix:ب Suffix: Stem:يصطاد (800000) Full Word:بيتارجح (2850) Prefix:ب Suffix: Stem:يتارجح (212000) Full Word:بيتهجأ (5) Prefix:ب Suffix: Stem:يتهجأ (10100) Full Word:بيركع (13100) Prefix:ب Suffix: Stem:يركع (568000) Then: we choose the greatest frequency if it is >= 10000
The final Results: An example of the output after last layer
Contents • Introduction • Multi dialect Morphology Analyser • Adopt MSA morphology analyser • Segment unknown words • Check on web corpus • Conclusions and future work
Conclusions and future work & Future work • Works on a larger corpus • Deal with diacritisation • Add more linguistic rules in both adopted MSA morphology analyser and in web searching to improve the accuracy
? Any questions ?Thank you