150 likes | 162 Views
This presentation discusses a syllabification approach for transliterated search, highlighting challenges in transliteration tasks and error analysis. The study aimed to improve local language support in web applications for domains like e-commerce. The approach involved morphological analysis, corpus-based identification of Hindi words, and a step-by-step algorithm for transliteration. Results, errors, and conclusions are elaborated, emphasizing the importance of precision in transliteration tasks.
E N D
FIRE 2013 Presentation on : Transliterated Search using Syllabification Approach By:- Hardik Joshi1, Apurva Bhatt1, Honey Patel2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1Department of Computer Science, Gujarat University, Ahmedabad, India. 2L.J. College of Engineering, Ahmedabad, India Dec @FIRE 4rth Dec 2013
Content Introduction Our Approach Syllabification Our Results Error And Analysis Conclusion
Introduction There is need to provide local language support in web based applications because various domains such as ecommerce sites require English knowledge. The challenge in transliteration is take the word “राष्ट्रपति” for this word “rashtrapati”, “rashtrapathi”, “raashtrapathy”, “raashtrpati” are various possible combinations may possible which one should be correct is again an issue. Transliteration tasks become difficult in presence of out of vocabulary words (OOV) and noisy words.
In both the subtasks, the transliteration was performed using syllabification approach. In the subtask-1, we had done the morphological analysis of English words , then a corpus based approach used to identify frequently occurring Hindi words. In the subtask-2, the queries were formulated that contained both Roman and Devanagari script and Roman script for separate run submissions.
Syllabification Approach Rhyme Linguists have different languages have constraints on possible consonant and vowel sequences that characterize not only the word structure for the language but also the syllable structure. Vowels @ center (nucleus) consonant @ beginning (onset) End is coda
Syllable Structure Example Word Sprint
Training Format Source Target s u d a k a r स ◌ु द ◌ा क र c h h a g a n छ ग ण j i t e s h ज ि◌ त ◌े श n a r a y a n न ◌ा र ◌ा य ण s h i v श ि◌ व m a d h a v म ◌ा ध व m o h a m m a d म ◌ो ह म ◌् म द
Algorithm for subtask-I Step 1: First of all words are fetching in English dictionary. Step 2: perform spell-check ,stemming and also morphological analysis for English language, if no spell error and match found then label the word as English =E. Step 3: If English word are not found then check with English corpus of US News paper. Step 4: If English word found then check with English corpus of Indian news paper. Step 5: If English word found in US News paper and not found in Indian news paper then word=E.
Step 6: Step 2 and step 5 are parallel apply for English words and label as =\E. Step 7: Remaining words would be transliterate into Hindi words and Label the word as = \H. Step 8: Apply to Moses tool ,which one is help English words transliterate into Hindi words.
Results For Subtask 2 Run 1 “मेरे सापनोन कि रानी काब् आयेगी तु mere sapnonki rani kabaayegitu”. Run 2 “mere sapnonki rani kabaayegitu”.
Error And Analysis There are some problems in the transliteration which decreased the precision. Error in the maatra: “sapnon” => “सापनोन”,“ki” => “की”, “kab” => “काब”, “main” => “मिन” & “mein” => “मीन”, na=> न & ka=> क Multiple Mapping of the words e.g. T= त, ट, i.e. tera=>टेरा, tum => तूम, to => टो, teri =>टेरि . Missing sounds (फ, ख, छ ‘chh’, ksh) i. e. for word “accha” we got “आक्का”, for , “poochho” we got “पूछोट”.
Multiple Transliterations- c,k The vowel are not giving perfect answers i.e. “lo” => “लॉ” , “ho”=> “होर”, “ko” => “कॉ” Spelling Variations(shree,shri) Conjuncts formation(“kya” => “केया”) Missing of vowels‘aktr khan’ (अक ् तर खान) ‘y’ As Vowel: ‘anthony’ & ‘Shyam’
Conclusion We used the syllabification approach and considered the most probable term in the transliteration process. The word labeling task was performed assuming that a term either belongs to English language or Hindi language. We were able to get high accuracy in English recall as the labeling approach used morphological analysis and dictionary approach. However due to syllabification model, the transliteration did not give high precision resulting in lower precision of transliteration tasks and subsequently lower precision metrics in the song lyrics retrieval tasks.