370 likes | 385 Views
Explore the orthography, morphology, and syntax of Semitic languages and how they impact machine translation. Focus on Semitic language similarities, variations, and translation divergences.
E N D
MT Summit IX Workshop Machine Translation for Semitic Languages Semitic Linguistic Phenomena and Variations Nizar Habash University of MarylandInstitute for Advanced Computer Studies
Road Map • Introduction • Orthography • Morphology • Syntax • Translation Divergences • Conclusion
Introduction • What this talk is about • Similarities that define “the Semitic family” • Variations differentiating members within the family • Similarities do not go beyond morphology and syntax • Relevance to NLP and MT • Most researchers focus on one Semitic language • Modern Standard Arabic (henceforth, A) • Modern Hebrew (henceforth, H) • Arabic Dialect: Palestinian Arabic (henceforth, P)
Road Map • Introduction • Orthography • Phonology • Scripts • Spelling • Ambiguity • Morphology • Syntax • Translation Divergences • Conclusion
Orthography: Script • Alphabets • Graphemic Variants • ك ككك (27 out of 36), כ ך (5 out of 22) • Encoding issues • Optional diacritics • Some Vowels שַ שֵ سَ سُ • Lack of vowel שְ سْ • Consonantal Doubling שּ سّ
Orthography: Spelling • Mostly consonantal Spelling • سلام = slam = salām, שלום = ʃlvm = ʃalom • Dual use of (a w/v jاويא ו י) as consonant and vowel • Diacritics as semantic markers • זכָר(zaxar male)זכַר(zaxar to remember) • كتب (kataba to write)كُتب(kutiba to be written)
Orthography: Spelling • Hebrew • Full Spelling, “Defective” Spelling (כתיב מלא,כתיב חסר) • kotelכותלכתל (wall) • Arabic • Morphophonemic Spelling • Feminine Marker ة (ta marbuta) • كبير (kabīr big ♂) كبيرة (kabīra big ♀) • Derivation Marker • hawa (to love هوى) (air هوا) • Hamza Variants (6 characters for one phoneme) • (ء أآإؤئ) بهاء بهاؤه بهائه
Orthography: Ambiguity A ى ئ ؤ إ آ أ ء ي و ه ن م ل ك ق ف غ ع ظ ط ض ص ش س ز ر ذ د خ ح ج ث ة ت ب ا ī j ū w h n m l k q f ʁ ʕ ḍ̄ ̣ ṭ ḍ ṣ ʃ s z r đ d x ħ ʤ θ t b ā ʔ ת ש ר ק צ פ ע ס נ מ ל כ י ט ח ז ו ה ד ג ב א ʃ r ts p f s n m l k j e i t x z o u h d g v b a H
Orthography: Ambiguity P ى ئ ؤ إ آ أ ء ي و ه ن م ل ك ق ف غ ع ظ ط ض ص ش س ز ر ذ د خ ح ج ث ة ت ب ا ī j ū w h n m l k q f ʁ ʕ ḍ̄ ṭ ḍ ṣ ʃ s z r đ d x ħ ʤ θ t b ā ʔ ẓ ē ō ת ש ר ק צ פ ע ס נ מ ל כ י ט ח ז ו ה ד ג ב א ʃ r ts p f s n m l k j e i t x z o u h d g v b a H
Road Map • Introduction • Orthography • Morphology • Derivational • Inflectional • Noun Inflections • Verb Inflections • Syntax • Translation Divergences • Conclusion
Morphology: Derivational • Roots and Patterns Meaning = (Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random ب ت ك ב ת כ K T B ? و ? ? مَ ? ו ? ? כתוב مكتوب
Morphology: Root Meaning • KTB: writing “stuff” كتاب book write كتب כתב כתיב spelling مكتبة library letter מכתב مكتوب כתובת address مكتب office writer كاتب כתב
Morphology: Root Meaning • LHM-1 لحم laHm לחם lexem
Morphology: Root Meaning • LHM-2 (battle sense) • ملحمة • Fierce battle, massacre, epic • מלחמה לוחמה לחם לוחם לחימה • War, battle, quarrel, conflict, combat, warfare, belligerence, fighting, quarreling, fighter, militarism, militancy, bellicosity
Morphology: Root Meaning • LHM-3 (Solder sense) • لحم تلاحم التحم ملتحم لُحمة • Weld, solder, get stuck, cling together, merged, fused, kinship • לחם הלחים מולחם מלחם • Solder, soldered, soldering iron,
Morphology: Root Meaning • LHM-4 (Conjuctiva sense) • لحمية • conjunctiva • לחמית • conjunctiva
Morphology: Noun Inflections conj • وكبيوتنا • نا + بيوت + ك + و • And-like-houses-our • And like our houses • שבבית • ש+ב+ה+בית • That-in-the-house • Which is in the house • Arabic Broken Plurals • Hebrew Ambiguous definiteness prep noun article plural poss
Morphology: Verb Inflections A: وسنكتبها And-will-we-write-it And we will write it H: ואהבתיה And-loved-I-her And I loved her P: وماحتستعمليلوش And-not-will-use-you-for-it-not And you will not use for it conj verb tense neg subj IOBJ object
Morphology: Verb Inflections • Perfect Verb Derivation (Suffixes only) • Imperfect Verb Derivation (Prefix+Suffix)
Road Map • Introduction • Orthography • Morphology • Syntax • Sentence Structure • Noun Phrase Structure • Translation Divergences • Conclusion
Sentence Structure • Sentence structure • Copular sentences • Verbal sentences • Copular sentences • Topic Complement • Definite Indefinite • الكلب كبيرהכלב גדול • The-dog big * topic comp كلب كبير dog big
Sentence Structure • Verbal sentences • The children wrote the poems • A: Verb Subject Object • كتب الاولاد الاشعار • Wrote the-children the-poems • H, P: Subject Verb Object • הילדים כתבו את השירים • The-children wroteobj the-poems • الاولاد كتبو الاشعار • The-children wrote the-poems
Noun Phrase • Noun Adjective • Noun-Adjective Agreement • number, gender, definiteness
Noun Phrase • اضافة / סמיכות (idafa/smixut) • Noun1 of Noun2 encoded structurally • Noun1-indefinite Noun2-definite • ملك الاردن מלך ירדן • king Jordan = the king of Jordan / Jordan’s king • Noun1 Form Change • Feminine (H and P) • ירדן + מלכה מלכת ירדן Queen of Jordan • Plural (A and H) • ירדן + מלכים ירדןמלכי Kings of Jordan • Alternatives (only H and P) • Noun1 <particle> Noun2 • الملك تبع الاردن the-king belonging-to Jordan • המלך של ירדן the-king that-for Jordan
Road Map • Introduction • Orthography • Morphology • Syntax • Translation Divergences • Conclusion
Translation Divergences • Variations beyond syntax • How languages map semantics to syntax • As complex and diverse as any other language • Divergence Dimensions • Categorial Variation (develop development) • Conflation (become frozen freeze) • Inflation (freeze become frozen) • Structural (enter the room enter into the room) • Head Swap (swim across cross swimming) • Thematic (John likes Mary Mary pleases John)
Translation Divergencesconflation * have יש عند كلب I dog כלב ל انا אני عندي كلب at-me dog I have a dog ישלי כלב therefor-me dog
Translation Divergences conflation ليس be * ا نا هنا I not here אני לא פה لست هنا I-am-not here I am not here לא פהאני I not here
Translation Divergencesthematic * be * ا نا بردان I cold קר ל אני انا بردان I cold I am cold קר לי cold for-me
Translation Divergencesstructural عثر find מצא انا على I man אני את رجل איש عثرت على الرجل found-I upon the-man I found the man מצאתי את האיש found-I obj the-man
Translation Divergences structural عثر find لقى انا على I man انا رجال رجل عثرت على الرجل found-I upon the-man I found the man لقيت الرجال found-I the-man
اسرع انا عبور سباحة swim نهر I across quickly river Translation Divergenceshead swap and categorial I swam across the river quickly اسرعت عبور النهر سباحة I-sped crossing the-river swimming
חצה swim אני את ב ב I across quickly נהר שחיה מהירות river Translation Divergenceshead swap and categorial חציתי את הנהר בשחיה במהירות I-crossed obj river in-swim speedily I swam across the river quickly
اسرع חצה انا عبور سباحة swim אני את ב ב نهر I across quickly נהר שחיה מהירות river Translation Divergences head swap and categorial verb verb noun noun verb noun noun prep adverb
Conclusion • Many defining features of the Semitic family • Orthographic conventions, morphological derivation and inflection, phrase structure, etc • Many variations that create different kinds of ambiguities and problems • Phonology of orthography, Semantics of derivation and inflection • Do similarities extend beyond morphology and syntax? • Translation divergences within Semitic family • Ambiguity preservation between Semitic languages