1 / 28

Linguistically Informed and Corpus Informed Morphological Analysis of Arabic

Majdi Sawalha & Eric Atwell School of Computing, University of Leeds, Leeds, LS2 9JT, UK sawalha@comp.leeds.ac.uk , eric@comp.leeds.ac.uk. School of Computing FACULTY OF ENGNEERING. Linguistically Informed and Corpus Informed Morphological Analysis of Arabic. Introduction

morley
Download Presentation

Linguistically Informed and Corpus Informed Morphological Analysis of Arabic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Majdi Sawalha & Eric Atwell School of Computing, University of Leeds, Leeds, LS2 9JT, UK sawalha@comp.leeds.ac.uk , eric@comp.leeds.ac.uk School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological Analysis of Arabic CL 2009

  2. Introduction Arabic Morphological Analyzers Arabic Corpora & Lexicons Analytical Study of Tri-literal Roots of Arabic Specifications of the Morphological Analyzer Morphological Features of Arabic Words and Tag Set Evaluation and Results Gold Standard for Evaluation Morphochallenge 2009 Qur’an Gold Standard Outline

  3. Introduction Methodologies for developing a robust Arabic morphological analyzer Syllable-based Morphology (SBM) Root-Pattern Methodology Lexeme-based Morphology Stem-based Arabic lexicon with grammar and lexis specifications Using tagged corpora and computer algorithms to build morphological database of the tagged words Roots, stems, patterns and affixes are pre-stored. Grammar and linguistic information are encoded with the analyzers

  4. Arabic Morphological Analyzers Buckwalter Morphological Analyzer Uses pre-stored dictionaries of words, stems and affixes constructed manually. Khoja’s Stemmer Removes the longest prefix and suffix of the word, Matches the processed word with lists of noun and verb patterns to extract the correct root of the word. Al-Shalabi et al Depends on mathematical calculations of weights assigned to the letters of the word, The algorithm selects the letters with lower weights as root letters.

  5. Comparative Evaluation of Arabic Morphological Analyzers Studying freely available morphological analyzers and stemmers. Developing a gold standard for evaluation. Results: More work is needed for the development of morphological analysis of Arabic. We can not rely on such analyzers for further analysis such as part-of-speech tagging and parsing.

  6. Arabic Corpora The Qur’an 78,000 tokens, 19,000 vowelized word types, 15,000 non-vowelized word types. The Corpus of Contemporary Arabic (CCA) Modern standard Arabic text corpus consists of 1 million word. The Penn Arabic Treebank 734 files, 166,000 words of written Modern Standard Arabic. The text of 15 traditional Arabic lexicons as corpora. About 11 million words and 2 million word types of both modern and classical Arabic text.

  7. Arabic Lexicons Methodologies of ordering lexical entries in the Arabic lexicons Al-Khalil methodology ( Listed the lexical entries based on the pronunciation of the letters, starting from the farthest in the mouth to the nearest)‏ Abi Obaid methodology ( Listed the lexical entries based on similarity in meaning.)‏ Al-Jawhari methodology ( Listed the lexical entries based on last letter of the word.) Al Barmaki methodology ( Listed the lexical entries alphabetically.)

  8. Arabic Lexicons A sample of Arabic lexicon كتب:الكِتابُ: معروف، والجمع كُتُبٌ وكُتْبٌ. كَتَبَ الشيءَ يَكْتُبهكَتْباً وكِتاباً وكِتابةً، وكَتَّبَه: خَطَّه؛ قال أَبو النجم: أَقْبَلْتُ من عِنْدِ زيادٍ كالخَرِفْ، تَخُطُّ رِجْلايَ بخَطٍّ مُخْتَلِفْ، تُكَتِّبانِ في الطَّريقِ لامَ أَلِفْقال: ورأَيت في بعض النسخِ تِكِتِّبانِ، بكسر التاء، وهي لغة بَهْرَاءَ، يَكْسِرون التاء، فيقولون: تِعْلَمُونَ ... k t b: [Alkitab] the book; is well known. The plural forms are [kutubun] and [kutbun]. [kataba Alshay’] He wrote something, [yaktubuhu] the action of writing something. [katban], [kitaban] and [kitabatan] means the art of writing. And [kattabahu] writing it means draw it up. Abu Al-Najim said: I returned back from Ziyad place [after meeting him] as senile, my legs draw up different drawings (means walking in different way). they wrote [tukattibani] on the road the letters of Lam Alif (describing how he was walking crazily and in different way). He said: I saw in different version, the word “they wrote” [tikittibani] using the short vowel kasrah on the first letter [taa], as it is used by Bahraa’ [Arab tribe] dialect. They say: [ti’lamuwn] (you know). A sample of Arabic-English Dictionary by Edward Lane

  9. Analytical Study of Tri-literal Roots of Arabic • Tri-literal roots were classified into 3 main groups and 22 detailed groups. Experiment 1: Qur’an words derived from tri-literal roots were analyzed, (45,534 words) and (1,610 tri-literal roots)‏ Qur’an tokens Tri-literal roots of Qur’an

  10. Analytical Study of Tri-literal Roots of Arabic • Experiment 2: • Word-types of broad-lexical resource constructed by analyzing 15 Arabic lexicons, which contains 376,167 word types Word types of broad-lexical resource Roots of broad-lexical resource

  11. Specifications of the Morphological Analyzers - Inputs 12 11 10 9 8 7 6 5 4 3 2 1 Position - ى َ ص ْ ص َ و وَصْصَى - Y a S o S a w waSoSaY - ا - و ُ ن َ م - ا - ء ءامَنُوا - A - w u n a m - A - ‘ ‘AmanuwA • Input: single words or text (fully vowelized, partially vowelized, or non-vowelized)‏ • Tokenization: Arabic word, number, currency or punctuation mark. • Processing Arabic words: • Resolving doubled letter marked with Shaddah • Resolving the Extention (maddah)‏ وَصَّىوَصْصَى waS~aY  waSoSaY آمَنُواءامَنُوا |manuwA ‘AmanuwA Only one short vowel might appear on any letter of the Arabic word.

  12. Stop Words (Unambiguous Words)‏ Stop word has only one morphological analysis wherever they appear in the text. About 40% of any text tokens belongs to stop words. The system contains a list of 1,368 stop words. Personal Pronouns : أنا “ >nA” I, هي “hy” she Relative pronouns : الذي“Al*y” who (sm), التي “Alty” who (sf)‏ Demonstrative pronouns : هذا “h*A” this (sm), هذه “h*h” this (sf)‏ Prepositions: في “fy” in, على “ElY” on , إلى “<lY” to

  13. Cliticts, Prefixes and Suffixes Proclitics, prefixes, suffixes and enclitics were collected from traditional Arabic grammar books. Clitics and affixes lists were checked using four Arabic corpora: The Qur’an The Corpus of Contemporary Arabic (CCA) The Penn Arabic Treebank The text of the 15 traditional Arabic lexicons as a corpus

  14. Cliticts, Prefixes and Suffixes Prefix Example P1 Tag P2 Tag P3 Tag فست فستـذكرون ف p--t--------------- س p--i--------------- ت r---s-nus---------- fst fst*krwn f s t وال والـسماء و p--t--------------- ال r---d-----d-------- wAl wAlsmA’ w Al Suffix Example P1 Tag P2 Tag P3 Tag تموهما أورثـتموها تم r---&-mps??----h--- و r---l-mp-n?----?--- هما r---&-ndt??----h--- tmwhA >wrvtmwhA tm w hmA يون الحواريون ي r---j-------------- ون r---l-mp-n?----?--- Ywn AlHwArywn y wn • 215 Proclitics & Prefixes 127 Suffixes & Enclitics

  15. Cliticts, Prefixes and Suffixes Analyzed Word يَعْمَلُونَ yaEomaluwna First Part Second Part Third Part Prefixes & Suffixes analyses يعملون yEmlwn Candidate analysis ي y عمل Eml ون wn Candidate analysis يعملو yEmlw ن n Not accepted يع yE م m لون lwn Not accepted • Words are divided into three parts of different size. • The first part is searched in the proclitics & prefixes list • The third part is searched in the suffixes & enclitics list

  16. Root or Stem Analyzed Word يَعْمَلُونَ yaEomaluwna First part Second part Third Part Affixes analyses Affixes and Root analyses يعملون yEmlwn Candidate analysis Not accepted analysis يعمل yEml ون wn Candidate analysis Not accepted analysis ي y عملون Emlwn Candidate analysis Not accepted analysis ي y عمل Eml ون wn Candidate analysis Accepted Analysis • The system uses a list of about 12,000 roots extracted by analyzing 15 traditional Arabic language lexicons • The second part of the word is searched by the root list.

  17. Word Pattern Verb Patterns POS Tag Noun Patterns POS Tag فَعَلْتُ faEalotu v-p---nsf---an?-st?- أُفْعُلاوَى >ufoEulAwaY nw----??-??----?qt-? فَعَلْنَا faEalonaA v-p---npf---an?-st?- اِفْعِيلال AifoEiylAl nw----??-??----?qt-? فَعَلْتَ faEalota v-p---mss---an?-st?- فاعُولاء fAEuwlA’ nw----??-??----?qt-? • Different words are derived from their roots using certain patterns. • Derived words inherent morphological features of the derivation patterns. • The system has a list of patterns which are extracted from traditional Arabic language grammar books. • 2730 verb patterns • 985 noun patterns • Morphological features POS tags are assigned to each pattern in the list. • Patterns are fully vowelized

  18. Pattern Matching Algorithms فَعْل فَعَل فَعُل فَعِل فُعْل فُعَل فُعُل فُعِل فِعْل فِعِل faEol FaEal faEul faEl fuEol fuEal fuEul fuEil fiEol fiEil • First algorithm: depends on the word and its root as inputs. • The root letters of the word are replaced by the letters (fa’, Aiin, Lam, [Lam]) (ف ، ع ، ل ، [ل]). Replacement of root letters is not an easy task !!!! • Second algorithm: depends on a pre-stored list of patterns. • Searches the pattern list for patterns of similar size as the analyzed word, after removing its affixes. • E.g: The word كتبktb matches the following patterns: • Replaces the letters of the word corresponding to the letters (Fa’, Ain, Lam, [Lam]) (ف ، ع ، ل ، [ل]) of the pattern.

  19. Word Pattern: The second algorithm (Example)‏ Analyzed Word يَعْمَلُونَ yaEomaluwna Matched Patterns Tag يَفْعُلُونَ yafoEuluwna v-c---mpt--ian?-st? يَفْعِلُونَ yafoEiluwna v-c---mpt--ian?-st? يَفْعَلُونَ yafoEaluwna v-c---mpt--ian?-st? يُفْعِلُونَ yufoEiluwna v-c---mpt--ipn?-at? يُفْعَلُونَ yufoEaluwna v-c---mpt--ipn?-tt?

  20. Vowelization Analyzed Word كتب ktb Pattern فَعْل فَعَل فَعُل فَعِل فُعْل فُعَل فُعُل فُعِل فِعْل فِعِل faEol FaEal faEul faEl fuEol fuEal fuEul fuEil fiEol fiEil Vowelization كَتْب كَتَب كَتُب كَتِب كُتْب كُتَب كُتُب كُتِب كِتْب كِتِب katob katab katub katib kutob kutab kutub kutib kitob kitib • Helps in determining some morphological features of the words.

  21. Morphological Features of Arabic Words and Tag Set http://www.comp.leeds.ac.uk/sawalha/tagset • Part-of-Speech Tag Set is designed following the traditional grammar classifications. • Tag Set has 22 morphological features of Arabic words. • The Tag consists of 22 characters. E.g. • v at the first position indicates verb, n at the second position indicates proper name. At the seventh positionm indicates masculine, andf indicates feminine • “ - “ is used If the value of a certain feature is not applicable for the tagged word. • “?” is used if the value of a certain feature belongs to word, but at the moment is not available or the automatic tagger could not guess it.

  22. Morphological Features of Arabic Words and Tag Set P Morphological Features Categories P Morphological Features Categories 1 Main POS أَقسام الكلام الرئيسيَّة 14 Voice المَبْني لِلمَعْلُوم و المَبْني لِلمَجْهُول 2 POS of Noun أقسام فرعيَّة (الاسم)‏ 3 POS of Verb أقسام فرعيَّة (الفعل)‏ 15 Emphasize المُؤكَّد وغيرُ المُؤكَّد 4 POS of Particle أقسام فرعيَّة (الحرف)‏ 16 Transitivity اللازم والمتعدي 5 Residuals أقسام فرعيَّة (أخرى)‏ 17 Humanness العاقل وغير العاقل 6 Punctuations علامات الترقيم 18 Variability & Conjugation التَّصريف 7 Gender الجنس 8 Number العدد 19 Augmented & Unaugmented المجرَّد والمزيد 9 Person الشخص 10 Morphology الصَّرف 20 Root letters عَدَد أحْرُف الجَذْر 11 Case and Mood الحالة الإعرابية للاسم أو الفعل 21 Verb Internal Structure بُنية الفعل 12 Case and Mood marks علامة الإعراب أو البناء 22 Noun finals أقسام الأسم تبعاً للفظ آخره 13 Definiteness المَعْرِفَة والنَّكِرَة http://www.comp.leeds.ac.uk/sawalha/tagset

  23. Morphological Features of Arabic Words and Tag Set Sample of tagged document using the morphological feature Tag Set وَوَصَّيْنَا الْإِنسَانَ بِوَالِدَيْهِ حُسْنًا We have recommended that a person must take good care of their parents.

  24. Gold standards are used to evaluate and measure the actual accuracy of automatic systems. To construct a gold standard for evaluation, we need to determine: The Problem Domain Evaluating morphological analyzers and part-of-speech taggers. The Corpora Corpora of different text domains, formats and genres of both vowelized and non-vowelized Arabic text. Two versions of the Qur’an text, vowelized Qur’an text, and non-vowelized Qur’an text. The Corpus of Contemporary Arabic (Al-Sulaiti & Atwell, 2006). Evaluation and Results:Gold Standard for Evaluation

  25. Gold Standard for Evaluation • Gold Standard Format • Includes morphological and part-of-speech information for each word of the gold standard in a line separated by tabs. • Contains the root and the pattern information of the words. • The gold standard will be stored using flat text files, using Unicode utf8 encoding or using XML. • Gold Standard Size • It must be relatively large. • can cover most cases that morphological analyzers have to handle. • It is measured by the number of words it contains.

  26. Morphochallenge 2009 Gold Standard http://www.cis.hut.fi/morphochallenge2009/ MorphoChallenge aims to develop an unsupervised morphological analyzer to be used for different languages including Arabic. A Gold standard of the Qur’an has been constructed to be used to evaluate morphological analyzers in Morphochallenge 2009 competition. Its size is 78,004 words. It contains the full morphological analysis for each word, according to the morphological analysis of the Qur’an in the tagged database of the Qur’an developed at the University of Haifa (Dror et al, 2004).

  27. Morphochallenge 2009Qur’an Gold Standard bsmsmNone b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen , AllhNoneNone llAh+Noun+ProperName+Gen+Def , AlrHm_nrHmfElAn rHmAn+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def AlrHymrHmfEyl rHym+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , بِسْمِسمNoneب+Prep , سم+Noun+Triptotic+Sg+Masc+Gen , اللّهِNoneNoneللَاه+Noun+ProperName+Gen+Def , الرَّحْمـَنِرحمفَعلَانرَحمَان+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , الرَّحِيمِرحمفَعِيلرَحِيم+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , بسمسمNoneب+Prep , سم+Noun+Triptotic+Sg+Masc+Gen , اللهNoneNoneللاه+Noun+ProperName+Gen+Def , الرحمـنرحمفعلانرحمان+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , الرحيمرحمفعيلرحيم+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , bisomismNone b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen , All~hiNoneNone llaah+Noun+ProperName+Gen+Def , Alr~aHom_anirHmfaElaAn raHmaan+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def Alr~aHiymirHmfaEiyl raHiim+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,

  28. Thank you! Questions ?

More Related