170 likes | 286 Views
Multilingual Information Access in a Digital Library. Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information Technology Hyderabad, India. Context. Digital Library of India 155,000 English books 145,000 Other language books
E N D
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information Technology Hyderabad, India
Context • Digital Library of India • 155,000 English books • 145,000 Other language books • Population of literates • 20% of India understand English • 80% can not IIIT Hyderabad - http://dli.iiit.ac.in
Multilingual Access to Information • Retrieve a book • By metadata • By keyword / content • Cross Lingual Information Retrieval • Read a book • Help understand sentences in a language • Help understand sentences across languages • Machine Translation IIIT Hyderabad - http://dli.iiit.ac.in
Approaches to Multilingual Access • Cross Lingual Retrieval • Translate Query to Document Language • Translate Document to Query Language • Machine Translation • Knowledge Based Approaches • Corpus Based Approaches • Hybrid Approaches IIIT Hyderabad - http://dli.iiit.ac.in
Challenges in Multilingual Access • Corpus Based Approaches • Unavailability of Parallel Corpus for pairs of languages • Unavailability of Computational Linguistics Resources • Dictionary Based Approaches • Unavailability of multiple bilingual dictionaries IIIT Hyderabad - http://dli.iiit.ac.in
Resources • Universal Dictionary • Conceived and implemented by Michael Shamos at CMU, USA • ITRANS • A transcription scheme and associated tool built by IISc, IIIT and CMU • Corpus • Data Entry by TTD and DLI project • TIDES project IIIT Hyderabad - http://dli.iiit.ac.in
Universal Dictionary IIIT Hyderabad - http://dli.iiit.ac.in
How are we doing it • Cross Lingual Search (Identify Information) • Dictionary lookup • User feedback based • Lucene Search Engine • Machine Translation (Understand Information) • Corpus based technique (EBMT) • Dictionary based word-word lookup • Good-enough translation vs Perfect translation IIIT Hyderabad - http://dli.iiit.ac.in
Cross Lingual Retrieval IIIT Hyderabad - http://dli.iiit.ac.in
Cross Lingual Retrieval IIIT Hyderabad - http://dli.iiit.ac.in
Reading Assistant System IIIT Hyderabad - http://dli.iiit.ac.in
Reading Assistant IIIT Hyderabad - http://dli.iiit.ac.in
Status Today • CLIR for 6 languages • MT for 3 languages • Shakti (a knowledge based MT system) • Parallel Corpus for Hindi-Eng • UDICT • About 40 Foreign Languages • 6 Indian Languages IIIT Hyderabad - http://dli.iiit.ac.in
What more is needed? • UDICT • Improving coverage of existing languages • Adding new languages • Machine Translation • Corpus acquisition • State of art techniques applied to Indian Languages • Multi-way parallel corpus development • Textual format for the books • Books currently are in Image formats • OCR should be developed for textual content IIIT Hyderabad - http://dli.iiit.ac.in
Thank You Questions ?