170 likes | 276 Views
Shared Task Proposal, FIRE 2012. Search in Transliterated Space. Monojit Choudhury Microsoft Research Lab India. A Transliterated World Wide Web. Song Lyrics. A Transliterated World Wide Web. Reviews and Forums. A Transliterated World Wide Web. Facebook and Twitter.
E N D
Shared Task Proposal, FIRE 2012 Search in Transliterated Space Monojit Choudhury Microsoft Research Lab India
A Transliterated World Wide Web Song Lyrics
A Transliterated World Wide Web Reviews and Forums
A Transliterated World Wide Web Facebook and Twitter
A Transliterated World Wide Web And lot more
Beyond Indic languages • Many languages that use non-Roman script • Arabic (Saudi Arabia, UAE, Egypt, Morocco,…) • Persian • Indian sub-continental languages (IL & Dzongkha, Nepalese, Sinhala) • Thai, Vietnamese • Cyrillic (Russian, Ukrainian) • Chinese, Japanese, Korean (rare)
Aspects of Transliterated Text • Transliteration • Errors, Contraction • Code Mixing
IR Scenario - I Mono-script Monolingual IR in transliterated space • Query: thandeehavayehchandnisuhanee • Results: Only Roman transliterated documents • Challenge: Spelling variations • tandeehawa ye chandnysoohaany
IR Scenario - II Cross-script and Multi-script Monolingual IR in transliterated space • Query: thandeehavayehchandniOR ठंडी हवा ये चाँदनी • Results: Both Roman transliterated or in native script • Challenge: Transliteration
Scenario - III • Cross-script and Cross-lingual IR • Query: death of mareech and subahoo • Document: Hindi (Transliterated and Devanagari) and English documents
Shared Task on Retrieval Transliterated documents in Roman Mono-script Monolingual IR Transliterated query in Roman Transliterated documents in native script Cross-script Monolingual IR Transliterated query in Roman Multi-script Monolingual IR Documents in Roman and native scripts Query in Roman or native script
Shared Sub-Tasks • Language identification of transliterated queries, documents, code-mixed text koodakazhikkanoruurgan split pea soup undaki ML MLMLMLEN ENENML • Transliteration • Forward: കഴിക്കാന് kazhikkan • Backward: kazhikkan കഴിക്കാന്
Available Data • 20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags) • 35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics • More data under preparation from FaceBook on mixture of various languages. • Looking for partners to extend!
Available Data • Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics • Looking for partners to extend it to other (Indian) Languages • Other domains?
Thank you! monojitc@microsoft.com
Other resources • Lexicons • Pronunciation lexicons • G2P for some languages • Stemmers and morphological analyzers • Anything else?
Concluding Remarks • We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing • These are just some initial ideas that came up from our experiences • If you are interested please let me know