250 likes | 430 Views
GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi. Kalyani Patel K.S.School of Business Management,Gujarat University. patel_kalyani_05@yahoo.co.in Dr. Jyoti Pareek Department of Computer Science,Gujarat University. drjyotipareek@yahoo.com.
E N D
GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi Kalyani PatelK.S.School of Business Management,Gujarat University. patel_kalyani_05@yahoo.co.in Dr. JyotiPareekDepartment of Computer Science,Gujarat University. drjyotipareek@yahoo.com
Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009
Introduction • GH-MAP is designed for a particular pair of a language to take advantage of similarity between sibling language pair Gujarati-Hindi. • It uses a rule based token mapping for effective word to word translation. • GH-MAP can be utilize for MT, CLIR, GurjerNet, Multilingual Dictionary. ICON 2009
Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009
Hindi-Gujarati : A comparative study • Indo-Aryan family (Hindi, Bangla, Assami, Punjabi, Marathi, Oriya and Gujarati) • Being same group, there is high degree of structural similarity • Hindi and Gujarati languages have • bijectively mappable characters (Varna Maala) excluding ळ. • relatively free word-order, where the noun group can come in any order followed generally by the verb group. ICON 2009
Continue.... • Nouns in Hindi and Gujarati languages are inflected based on the case (direct or oblique), number (singular or plural), and the gender (masculine or feminine). In addition to this Gujarati language also has common gender . • Verbs in both the languages are inflected based on gender, number, person, tense, aspect, modality, formality, and voice. ICON 2009
Continue... • Many words in the languages have a shared origin (from Sanskrit) and because of shared culture, they usually also share meaning e.g. (book) ‘પુસ્તક’/ ‘puswaka’ in Gujarati is similar to ‘पुस्तक’/ ‘puswaka’ in Hindi. • Sentence from one language can be mapped to sentence in another language by substituting each word group in source language by appropriate word group in the target language. ICON 2009
Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009
The Rule Base Rule Base for Translation: • Domain Specific monolingual data • Stores typologically different words and their relations • Domain Independent bilingual data • Stores cases , pronouns, adjectives, adverbs etc.. • Substring Substitution rules • Stores Hindi substrings corresponding to Gujarati substring and location of substring • Stem – Suffix rules • Stores bilingual stem and suffix rules • Phrases • Stores bilingual compound words ICON 2009
Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009
START Sentences in language1 Yes Phrase No Tokenize the sentence Token Mapping Engine Language 2 tokens Sentences in language 2 STOP Translation GH-MAP Translate a text in source language to a text in the target language , retaining a flavor of the source language. GH-MAP utilize Token Mapping Engine for translation. ICON 2009
Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009
Token Mapping Engine Token Mapping Engine uses Rule Base for finding the match of a given token in target language. ICON 2009
Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009
TME Algorithm • For each SL (Source Language) word (token): • Search the word in • Table of pronouns, cases, adjectives. If match found then get TL (Target Language) word from the same table. Go to step 7. • Table of domain specific words. If match found then get corresponding TL words from the table of TL domain specific words. Go to step 7. • Remove suffix. Search for stem in table of Stem. If match found then get TL stem and corresponding TL suffix. Generate TL word. Go to step 7. • Search repeatedly for substring (affix) in SL word. If match found then substitute SL substring with corresponding TL substring. Go to step 6. • Transliterate remaining non-translated characters by TL character. • Next ICON 2009
Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009
Example (‘The lotus blossoms’ (E)) • Tokenize the sentence કમળ’/kamalYa + ‘નું’/nuM + ‘ખીલવું’/ KIlavuM. • Tokens are given to Token Mapping Engine • First token ‘કમળ’/kamalYais translated by substituting substring ‘ળ’/lYa by ‘ल’/la and remaining Gujarati character ‘કમ’/kamatransliterate to ‘कम’/kama to generate‘कमल’/kamala ICON 2009
Second token ‘નું’/nuM is translated to ‘का’/kA using Case (Karaka) table. • Third token ‘ખીલવું’/KIlavuMis translated by first removing suffix ‘વું’/vuM, to obtain stem ‘ખીલ’/KIla, the stem is searched in table of stem and corresponding stem in Hindi ‘खिल‘/Kila is obtained, & corresponding suffix of ‘વું’ /vuM in target language i.e. ‘ना‘/nA is obtained to generate ‘खिलना’/KilanA ICON 2009
Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009
Evaluation Thus we can conclude that for given test bed GH-MAP could produce about 88% correct translation ICON 2009
Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009
Conclusion • To the best of our knowledge, this is the first attempt at rule based token mapping for sibling language pair Hindi-Gujarati. • In this model, only lexical analysis is carried out. • It requires only limited linguistic effort and tools for achieving the said goal. • The test results for a small set of data are encouraging. • There are some limitations of GH-MAP, which needs to be addressed. ICON 2009
Limitations • karaka : का/ kA(H) can be map to નો/no /ની/nI/નું/nuM /ના/nA(G)[of (E)]. • pronoun :उसे /se (H) can be map to તેનો/weno/ તેની/wenI/ તેને/wene (G) [He/She/It (E)] • adjective :नया/nayA(H) can be map to નવું/navuM/નવા/navA (G) [New (E) • Work is in progress towards overcoming these limitations. • With further enhancement in rule base, GH-MAP is expected to yield better result. ICON 2009
Thank You ICON 2009