850 likes | 2.07k Views
CITALA'09. Rule-based approach in Arabic NLP: Tools, Systems and Resources. Dr Khaled Shaalan Professor, Faculty of Computers & Information, Cairo University On Secondment to BUiD, UAE Khaled.shaalan@{buid.ac.ae, gmail.com}. Agenda. Objective Language Tasks
E N D
CITALA'09 Rule-based approach in Arabic NLP: Tools, Systems and Resources Dr Khaled Shaalan Professor, Faculty of Computers & Information, Cairo University On Secondment to BUiD, UAE Khaled.shaalan@{buid.ac.ae, gmail.com} CITALA2009 - Morroco
Agenda • Objective • Language Tasks • NLP Approaches • Rule-based Arabic Analysis and generation tools • Rule-based Arabic NLP applications • Some Arabic NLP Free Resources • Major and Arabic mailing lists • Conclusion
Objective • To show how rule-based approach has successfully used to develop Arabic natural language processing tools and applications.
Separating Language Tasks • English vs. French vs. Arabic vs . . . • spoken language (dialogue) vs written test vs hand written script • Genuine Script vs transliterated (Romanized) script • Vocalized (vowelized) vs non-vocalized • Understanding vs. generation • First language learner vs second language learner • Classical or Qur’anical Arabic vs Modern Standard Arabic vs colloquial (dialects) • Stem-based vs root-based
Rules • Situation/Action • If match(stem.prefix, def_article)then romve(stem.prefix,Stem_FS) • If match(stem.definitness,indefinite)then morph_gen(stem.definitness,Stem_FS)
Common Mistake • Rule-based approach is not a rule-based expert systems !!!!!!! • Both consist of rules. • Rule-based expert systems solves the problem by Recognize-Act Cycle • Loop • Conflict resolution strategy
Domain Knowledge Rule Base New Rule n 1 Conflict Resolution Match Execute New Fact Fact Base Working Memory Recognize-Act Cycle loop • Match: Rules are compared to working memory to determine matches. if no rule matches then stop • Conflict Resolution: Select or enable a single rule for execution • Execute: Fire the selected rule • Add new fact, or • Learn a new rule end loop
NLP Approaches • Rule-based • Statistical-based
Relies on hand-constructed rules that are to be acquired from language specialists requires only small amount of training data development could be very time consuming developers do not need language specialists expertise requires large amount of annotated training data (very large corpora) automated NLP Approaches (1) Rule-based Statistical-based
some changes may be hard to accommodate not easy to obtain high coverage of the linguistic knowledge useful for limited domain Can be used with both well-formed and ill-formed input High quality based on solid linguistic some changes may require re-annotation of the entire training corpus Coverage depends on the training data Not easy to work with ill-formed input as both well-formed and ill-formed are still probable Less quality - does not explicitly deal with syntax NLP Approaches (2) Rule-based Statistical-based
Rule-based Arabic NLP tools • Morphological Analyzers • Morphological Generators • Syntactic Analyzers • Syntactic Generators
Morphological Analysis • Breakdown the inflected Arabic word into a root/stem, affixes, features. • Example: sa- ‘uEty- kumA (ﺳﺄﻋﻂﯾﻜﻤﺎ) - ‘will I give you…’
Rules - Augmented Transition Network (ATN) technique • Rules associated with arcs represent the context-sensitive knowledge about the relation between a root and inflections. • More than one rule may be associated with one arc. • Conditions associated with the arcs are placed in such a way that the arc to be traversed first is the one that leads to the most probable solution.
Types of Rules • Remove Prefix or Suffix • Remove doubled letter • Add/change Hamza, Weak letter,… • …
Analysis of the verb "شاهدتك" (I saw you): Remove suffixes شاهدت شاهدتك last1 = “ك” last2 = “ت” شاهد S10 S3 S0 S1 S2 • stem: "شاهد" (saw) • perfect • 1st person sg pronoun: "ت" • 2nd person sg pronoun "ك"
Analysis of the verb ”يلعبون“ (they are playing): Remove prefix & suffix لعبون لعبون لعب Begin2 = “ي” last2 = “ون” S10 S3 S0 S1 S2 • stem: “لعب" (played) • imperfect • Plural subject
Issues in the morphological analysis • Overgeneration (too many output) • Ambiguity • Reconstruction of vowels • MultiWord/compound Expressions • Out-of-Vocabulary (OOV) • Handling ill-formed input • Detection (spell checking) • Correction- relaxation “ه” instead of “ة” • Prevent ill-formed output • Check the compatibility (the prefix “ف” cannot come after the prefix “ب” (or “ك”)).
Morphological generation • Synthesis of an inflected Arabic word from a given root/stem according to a combination of morphological properties that include: • definiteness (definite article “ال”), • gender (masculine, feminine), • number (singular, dual, plural), • case (nominative, genitive, accusative,…), • person (first, second, third) • …
Types of Rules • synthesis of inflected • Noun • Verb • particle
Synthesis of inflected Nouns • definite noun • feminine noun • pluralize noun • dual noun • attach a prefix preposition • attach a suffix pronoun • end case • ….
Synthesis of feminine noun • If noun.gender = masculineThen attach suffix feminine letter • Example: • ”زوج“)husband) “زوجة”(wife)
Synthesis of suffix pronoun • If pronoun.person = first and pronoun.number = singular Then attach first person singular suffix pronoun • Example: • “زوجة”(wife) “زوجتي” (my wife)
Synthesis of inflected Verbs(very complex-rich in form and meaning) • conjugate a verb with tense • conjugate a verb with number • conjugate a verb with prefix pronoun • conjugate a verb with suffix pronoun • ….
Rule: synthesize first person plural of assimilated verbs Input: first person singular past verb Output: inflected verb Example: نصل- سنصل - وصلنا If verb.tense = future then remove first weak & attach_prefix(""سن) else if verb.tense = present then remove first weak & attach_prefix(""ن) else attach_suffix(verb.stem,"نا")
Issues in the morphological generation • MultiWord/compound Expressions • Out-of-Vocabulary (OOV) • Some forms need special handling: • Substitution: This man – هذا الرجل • literal numbers (complex nouns) • Arabic script • ‘ل’ + ‘ال’ ‘للـ’ • ”زملاء“ + ”ي“ ‘زملاءي’ ‘زملائي’ • ”غرفة“ “غرفتان”
Types of Rules • Grammatical rules: • Describe sentence and phrase structures, and ensure the agreement relations between various elements in the sentence. • Parsing • Accepts the input and generates the sentence structure (parse tree)
Parsing of the sentence “الطالبة مجتهدة”The student (sg,f) is diligent (sg,f) الطالبة مجتهدة noun (definite,fem,sg) noun (indefinite,fem, sg) definite(definite, fem, sg) enunciative (indefinite,fem, sg) Inchoative (defined, fem, sg) nominal sentence • Agreement: • Number • Gender Nominal sentence -> definite_Inchoative(Number,Gender) indefinite_enuciative(Number,Gender)
Issues in the syntactic analysis • Ambiguity (more than parse tree) • Disambiguation techniques • Handling ill-formed input • Detection (grammar checking) • Recovering (Partial parsing - parses = chunks to be related)
Types of Rules • Determine phrase structures • Determine syntactic structure • Ensure the agreement relations between various elements in the sentence.
Rule: verb-subject agreement Input: verb and inflected subject (a pre-verbal NP ) Output: inflected verb agreed with its inflected subject synthesize_verb(Subject.number,verb.stem) synthesize_verb(Subject.gender,verb.stem)
الأولادزارواخمسمتاحفقديمة Adj-noun counted-Num verb-Subject (G) (G) (N,G) An agreement example: الأولاد زارواخمس متاحف قديمة the-boys visited-they five museum old The boys visited five old museums
Issues in the syntactic generation • Word order (VSO,SVO, etc.) • Agreement (full/partial) • dropping the subject pronoun (called Pro-drop), i.e., to have a null subject, when the inflected verb includes subject affixes. • Syntax that captures the source/intended meaning • My son is 8 = أبني عمره ثماني سنوات • I did not understand the last sentence = أنا لم أفهم الجملة الأخيرة
A Rule-based Arabic NLP applications • Named Entity Recognition • Machine translation • Transferring Egyptian Colloquial Dialect into Modern Standard Arabic
What is entity recognition? • Identifying, extracting, and normalizing entities from documents such as names of people, locations, or companies. • Makes unstructured data more structured
Politics of Ukraine In July 1994, Leonid Kuchma was elected as Ukraine's second president in free and fair elections. Kuchma was reelected in November 1999 to another five-year term, with 56 percent of the vote. International observers criticized aspects of the election, especially slanted media coverage; however, the outcome of the vote was not called into question. In March 2002, Ukraine held its most recent parliamentary elections, which were characterized by the Organization for Security and Cooperation in Europe (OSCE) as flawed, but an improvement over the 1998 elections. The pro-presidential For a United Ukraine bloc won the largest number of seats, followed by the reformist Our Ukraine bloc of former Prime Minister Viktor Yushchenko, and the Communist Party. There are 450 seats in parliament, with half chosen from party lists by proportional vote and half from individual constituencies. Entity Extractor Person Date Location
Person Entity Recognition (1) Example: ‘الملك الأردني عبدالله الثاني’ The Jordanian king Abdullah II • We want to have a rule that recognizes a person name composed of a first namefollowed by optional last names, based on a preceding person indicator pattern.
Person Entity Recognition (2) The Rule component of this example: • Name Entity: عبدالله [Abdullah] • indicator pattern: • an honorificsuch as "الملك" [The king] • Nasab: (optional) inflected from a location name "الأردني" [Jordanian]. • The rule also matches an optional ordinalnumber appearing at the end of some names such as "الثاني" [II].
Person Entity Recognition (3) ((honorfic+(location(ية|ي))?)+ first_Name(last_Name)?+(number)?) • This (Regular Expression) rule can recognize: • الملكعبدالله • الملك الأردني عبدالله • الملك الأردني عبدالله الثاني • الملكة الأردنيةرانيا • …
Issues in the Arabic NER • Complex Morphological System (inflections) • Non-casing language (No initial capital for proper nouns) • Non-standardization and inconsistency in Arabic written text (typos, and spelling variants) • Ambiguity
Machine Translation • Direct • Transfer • Interlingua
Interlingua Direct Transfer MT ApproachesMT Pyramid Source syntax Target syntax Source word Target word Analysis Generation
English-to-Arabic Transfer based Approach source sentence (English) Morphological & syntactic Analysis Rules of English English Dic. Sentence Analysis English Parse Tree English-to-Arabic Transformation Rules Bi-ling Dic. Transfer Arabic Parse Tree Morphological Gen. & Synthesis Rules of Arabic Arabic Dic. Sentence Synthesis Target sentence (Arabic)
Transfer approach • Involves analysis, transfer, and generation components • If you have an Arabic parser & Arabic syntactic generator, All you need is to acquire the transfer rules and build the transfer component
Simple Transfer (1) [wi:$1, wi+1:$2, …, wk:$k] (1 i k) [wk:$k, wk-1:$k-1, …, wi:$i] (1 i k)
np np noun تقييم sg noun networks pl np np noun أداء sg np noun performance sg np noun شبكة pl noun evaluation sg Networks performance evaluation تقييم أداء شبكة transfer