310 likes | 620 Views
Arabic NLP: Challenges & Opportunities. Dr. Samir Tartir Scientific Day Faculty of Information Philadelphia University May 15 th 2013. ثمن. علم. قِ. General Information. History (Classical) Arabic has remained unchanged, intelligible and functional for more than fifteen centuries.
E N D
Arabic NLP: Challenges & Opportunities Dr. Samir Tartir Scientific Day Faculty of Information Philadelphia University May 15th 2013
General Information • History • (Classical) Arabic has remained unchanged, intelligible and functional for more than fifteen centuries. • Strategically important • 330 million speakers living in an important region • huge oil reserves, sacred sites. • 1.4 billion Muslims use in their prayers. • Cultural and literary heritage • Closely associated with Islam
Versions • Classical • Modern • Dialects
Arabic Language Characteristics • Highly structured • Highly derivational language • Morphology • Free word order • Modern Arabic lacks diacritics (short vowels)
Example* *Microsoft Arabic NLP Toolkit (ATK) For Academia in the Arab World Presentation, 11/2012
Arabic Language Characteristics • Synonymy and confusion of non-standardized terms • Thermometer: محر، محرار، مقياس حرارة، ميزان حرارة، ترمومتر • Technical translation • Hydrometer: جهاز قياس كثافة السوائل • Uncle, parent…
Letters • One letter, one sound • Letters change shape • Hamza • No capital letters • Can use normalization
Ambiguity • Homographs • قدم • Internal word structure ambiguity • بعقوبة • Syntactic ambiguity • قابلت مدير البنك الجديد • Semantic ambiguity • يحب علي احمد اكثر من ابراهيم • Anaphoric ambiguity • قابل الصحفي الوزير الذي انتقده
NLP • Automatic summarization • Machine translation • Named entity recognition (NER) • Natural language generation • Natural language understanding • Optical character recognition (OCR) • Question answering • Sentiment analysis • Speech recognition • Word sense disambiguation • Information retrieval (IR) • Speech processing • Text-to-speech • Natural language search • Automated essay scoring • etc
Question Answering** Hammo et al. QARAB: A Question Answering System to Support the Arabic Language. Workshop on Computational Approaches to Semitic Languages. ACL 2002
Arabic NLP Issues • Lack of tools • Lack of linguistic references • Lack of training data
Available Tools • Arabic Treebank • Arabic WordNet • MySQL database • SUMO Ontology • Java • Microsoft Arabic Toolkit (ATK)
Summary • Arabic is difficult to deal with • Progress has been made • More work is done on different parts • Any progress is valuable • Business • Personal • Governmental