GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi

GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi Kalyani PatelK.S.School of Business Management,Gujarat University. patel_kalyani_05@yahoo.co.in Dr. JyotiPareekDepartment of Computer Science,Gujarat University. drjyotipareek@yahoo.com

Contents • Introduction • Hindi-Gujarati : A comparative study • The Rule Base • Translation • Token Mapping Engine • Algorithm • Example • Evaluation • Conclusion ICON 2009

Introduction • GH-MAP is designed for a particular pair of a language to take advantage of similarity between sibling language pair Gujarati-Hindi. • It uses a rule based token mapping for effective word to word translation. • GH-MAP can be utilize for MT, CLIR, GurjerNet, Multilingual Dictionary. ICON 2009

Hindi-Gujarati : A comparative study • Indo-Aryan family (Hindi, Bangla, Assami, Punjabi, Marathi, Oriya and Gujarati) • Being same group, there is high degree of structural similarity • Hindi and Gujarati languages have • bijectively mappable characters (Varna Maala) excluding ळ. • relatively free word-order, where the noun group can come in any order followed generally by the verb group. ICON 2009

Continue.... • Nouns in Hindi and Gujarati languages are inflected based on the case (direct or oblique), number (singular or plural), and the gender (masculine or feminine). In addition to this Gujarati language also has common gender . • Verbs in both the languages are inflected based on gender, number, person, tense, aspect, modality, formality, and voice. ICON 2009

Continue... • Many words in the languages have a shared origin (from Sanskrit) and because of shared culture, they usually also share meaning e.g. (book) ‘પુસ્તક’/ ‘puswaka’ in Gujarati is similar to ‘पुस्तक’/ ‘puswaka’ in Hindi. • Sentence from one language can be mapped to sentence in another language by substituting each word group in source language by appropriate word group in the target language. ICON 2009

The Rule Base Rule Base for Translation: • Domain Specific monolingual data • Stores typologically different words and their relations • Domain Independent bilingual data • Stores cases , pronouns, adjectives, adverbs etc.. • Substring Substitution rules • Stores Hindi substrings corresponding to Gujarati substring and location of substring • Stem – Suffix rules • Stores bilingual stem and suffix rules • Phrases • Stores bilingual compound words ICON 2009

START Sentences in language1 Yes Phrase No Tokenize the sentence Token Mapping Engine Language 2 tokens Sentences in language 2 STOP Translation GH-MAP Translate a text in source language to a text in the target language , retaining a flavor of the source language. GH-MAP utilize Token Mapping Engine for translation. ICON 2009

Token Mapping Engine Token Mapping Engine uses Rule Base for finding the match of a given token in target language. ICON 2009

TME Algorithm • For each SL (Source Language) word (token): • Search the word in • Table of pronouns, cases, adjectives. If match found then get TL (Target Language) word from the same table. Go to step 7. • Table of domain specific words. If match found then get corresponding TL words from the table of TL domain specific words. Go to step 7. • Remove suffix. Search for stem in table of Stem. If match found then get TL stem and corresponding TL suffix. Generate TL word. Go to step 7. • Search repeatedly for substring (affix) in SL word. If match found then substitute SL substring with corresponding TL substring. Go to step 6. • Transliterate remaining non-translated characters by TL character. • Next ICON 2009

Example (‘The lotus blossoms’ (E)) • Tokenize the sentence કમળ’/kamalYa + ‘નું’/nuM + ‘ખીલવું’/ KIlavuM. • Tokens are given to Token Mapping Engine • First token ‘કમળ’/kamalYais translated by substituting substring ‘ળ’/lYa by ‘ल’/la and remaining Gujarati character ‘કમ’/kamatransliterate to ‘कम’/kama to generate‘कमल’/kamala ICON 2009

Second token ‘નું’/nuM is translated to ‘का’/kA using Case (Karaka) table. • Third token ‘ખીલવું’/KIlavuMis translated by first removing suffix ‘વું’/vuM, to obtain stem ‘ખીલ’/KIla, the stem is searched in table of stem and corresponding stem in Hindi ‘खिल‘/Kila is obtained, & corresponding suffix of ‘વું’ /vuM in target language i.e. ‘ना‘/nA is obtained to generate ‘खिलना’/KilanA ICON 2009

Contribution of various Approaches in translation ICON 2009

Evaluation Thus we can conclude that for given test bed GH-MAP could produce about 88% correct translation ICON 2009

Conclusion • To the best of our knowledge, this is the first attempt at rule based token mapping for sibling language pair Hindi-Gujarati. • In this model, only lexical analysis is carried out. • It requires only limited linguistic effort and tools for achieving the said goal. • The test results for a small set of data are encouraging. • There are some limitations of GH-MAP, which needs to be addressed. ICON 2009

Limitations • karaka : का/ kA(H) can be map to નો/no /ની/nI/નું/nuM /ના/nA(G)[of (E)]. • pronoun :उसे /se (H) can be map to તેનો/weno/ તેની/wenI/ તેને/wene (G) [He/She/It (E)] • adjective :नया/nayA(H) can be map to નવું/navuM/નવા/navA (G) [New (E) • Work is in progress towards overcoming these limitations. • With further enhancement in rule base, GH-MAP is expected to yield better result. ICON 2009

Thank You ICON 2009

GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi

GH-MAP: Rule Based Token Mapping For Translation between Sibling Language Pair: Gujarati-Hindi

Presentation Transcript

Mapping Between Taxonomies

Statistical XFER: Hybrid Statistical Rule-based Machine Translation

Zoroastrian Poets (in Gujarati language)

Language Translation

Stochastic and Rule Based Tagger for Nepali Language

Map Language

Token Based Firewalls

Mapping Between Taxonomies

Hindi to English translation - sourcecode

Enjoy the Benefits of Hindi Language Translation Services: Anytime, Anywhere

Language translation services for all languages translation

Hindi to English translation

Language translation

The Difference Between Parent/Child and Sibling to Sibling Transfer

Hindi language facts

DIFFERENCES BETWEEN MACHINE LEARNING AND RULE BASED SYSTEMS

short stories in English, Hindi, Gujarati and Marathi Language

Simplifying Language Barriers – English to Hindi Translation

The Importance of English to Gujarati Translation

6 Hacks for Effective Hindi Translation