IIIT Hyderabad’s CLIR experiments for FIRE-2008

IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & VasudevaVarma IIIT Hyderabad, India

Outline • Introduction • Related Work in Indian Language IR • Our CLIR experiments • Evaluation & Analysis • Future Work IIIT-H @ FIRE-2008

Introduction • Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (courtesy: Wikipedia) • Information – text, audio, video, speech, geographical information etc IIIT-H @ FIRE-2008

CLIR – Indian languages(IL) scenario To retrieve documents written in any IL when user queries in one language मराठी हिन्दी తెలుగు தமிழ் বাংলা Modified from Source: D. Oard’s Cross-Language IR presentation IIIT-H @ FIRE-2008

Why CLIR for IL? IIIT-H @ FIRE-2008

Why CLIR for IL? • Internet user growth in India between 2000 to 2008 - 1,100.0 % Source : www.internetworldstats.com • Growth in Indian language contents on the web between 2000 to 2007 – 700% So, CLIR for IL becomes mandatory! IIIT-H @ FIRE-2008

Related Work in Indian Language IR IIIT-H @ FIRE-2008

Related Work in ILIR • ACM TALIP, 2003 - The surprise language exercises - Task was to build CLIR system for English to Hindi and Cebuano “The surprise language exercises”, Douglas W. Oard. ACM Transactions on Asian Language Information Processing (TALIP), 2(2):79–84, 2003 IIIT-H @ FIRE-2008

Related Work in ILIR • CLEF 2006 - Ad-hoc bi-lingual track including two Indian languages Hindi and Telugu - Our team from IIIT-H participated in Hindi and Telugu to English CLIR task “Hindi and Telugu to English Cross Language Information Retrieval”, Prasad Pingali and VasudevaVarma. CLEF 2006. IIIT-H @ FIRE-2008

Related Work in ILIR • CLEF 2007 - Indian language subtask consisting of Hindi, Bengali, Marathi, Telugu and Tamil - Five teams including ours participated - Hindi and Telugu to English CLIR “IIIT Hyderabad at CLEF 2007 - Adhoc Indian Language CLIR task”, Prasad Pingali and VasudevaVarma. CLEF 2007. IIIT-H @ FIRE-2008

Related Work in ILIR Google’s CLIR system for 34 languages including Hindi IIIT-H @ FIRE-2008

Our clir experiments IIIT-H @ FIRE-2008

Our CLIR experiments • Ad-hoc cross-lingual Hindi to English, and English to Hindi. • Ad-hoc monolingual runs in Hindi and English • 12 runs in total were submitted for the above 4 tasks IIIT-H @ FIRE-2008

Problem statement • CLIR system should take a set of 50 topics in the source language and return top 1000 documents for each topic in the target language <top lang="hi"> <num>28</num> <title>ईरान का परमाणु कार्यक्रम</title> <desc>ईरान का कार्यक्रम और उसकी परमाणु नीति के बारे में विश्व की राय।</desc> <narr>ईरान की परमाणु नीति और ऐसे कार्यक्रम के विरुद्ध ईरान पर यूएसए का निरंतर दबाव और धमकी के बारे में सूचना संबंधित प्रलेख में रहनी चाहिए। परमाणु नीति के समझौते के लिए ईरान और यूरोपीय संघ के बीच वार्ता और विश्व दृष्टि भी रुचिकर होगी</narr> </top> IIIT-H @ FIRE-2008

CLIR System architecture • Query Processing module • Named Entities identification • Query translation using lexicons • Transliteration • Query Scoring • Indexing module • Stop-word remover, • A typical Indexer using Lucene IIIT-H @ FIRE-2008

Named entities Identification • Used for identifying the named entities present in the queries for transliteration • We used • Our CRF-based NER system( as a binary classifier) for Hindi queries, • Stanford English NER system for English queries • Identifies Person, Organization and Location names "Experiments in Telugu NER: A Conditional Random Field Approach“,Praneeth M Shishtla, Prasad Pingali, VasudevaVarma. NERSSEAL-08, IJCNLP-08, Hyderabad, 2008. IIIT-H @ FIRE-2008

CLIR System architecture • Query Processing module • Named Entities identification • Query translation using lexicons • Transliteration(mapping-based) • Query Scoring • Indexing module • Stop-word remover, • A typical Indexer using Lucene IIIT-H @ FIRE-2008

Query translation • Using bi-lingual lexicons • “Shabdanjali”, an English-Hindi dictionary containing 26,633 entries • IIT Bombay Hindi Wordnet • Manually collected Hindi-English dictionary with 6,685 entries Shabdanjali - http://www.shabdkosh.com/shabdanjali Hindi Wordnet - http://www.cfilt.iitb.ac.in/wordnet/webhwn/ IIIT-H @ FIRE-2008

Transliteration • Mapping-based approach • For a given named entity in source language • Derive the Compressed Word Format (CWF) E.g. academia – cdm E.g. abullah - bll • Generate the list of Named entities & their CWFs at the target language side • Search and map the CWF of source language NE with the CWF of the right target language equivalent within the min. modified edit distance IIIT-H @ FIRE-2008

Transliteration • Implementation • Named entities present in the Hindi and English corpora are identified and listed. • Their CWFs are generated using a set of heuristic, rewrite and remove rules • CWFs are added to the list of NEs “Named Entity Transliteration for Cross-Language Information Retrieval using Compressed Word Format Mapping algorithm”, Srinivasan C Janarthanam, Sethuramalingam S, UdhyakumarNallasamy. iNEWS-08, CIKM-2008. IIIT-H @ FIRE-2008

Query Scoring • We generate a Boolean OR query with scored query words • Query scoring is based on • Position of occurrence of the word in the topic • Number of occurrences of the word • Numbers, Years are given greater weights IIIT-H @ FIRE-2008

CLIR System architecture • Query Processing module • Named Entities identification • Query translation using lexicons • Transliteration(mapping-based) • Query Scoring • Indexing & Ranking module • Stop word remover, • A typical Indexer using Lucene IIIT-H @ FIRE-2008

Indexing module • For the English corpus, stop words are removed and stemmed using Lucene • For the Hindi corpus, a list of 246 words is generated from the given corpus based on frequency • Documents are indexed using the Lucene Indexer and ranked using the BM-25 algorithm in Lucene IIIT-H @ FIRE-2008

Evaluation & Analysis IIIT-H @ FIRE-2008

Evaluation • English-Hindi cross-lingual run IIIT-H @ FIRE-2008

Evaluation • Hindi-English cross-lingual run IIIT-H @ FIRE-2008

Evaluation • Hindi-Hindi monolingual run IIIT-H @ FIRE-2008

Evaluation • English-English monolingual run IIIT-H @ FIRE-2008

English-Hindi Vs Hindi-Hindi IIIT-H @ FIRE-2008

Hindi-English Vs English-English IIIT-H @ FIRE-2008

Evaluation • Summary • Our English-Hindi CLIR performance was 58% of the monolingual run • Our Hindi-English CLIR performance was 25% of the monolingual run • Our Hindi-Hindi monolingual run retrieved 52% of total relevant documents • Our English-English monolingual run retrieved 91% of total relevant documents IIIT-H @ FIRE-2008

Analysis • Our English-Hindi CLIR performance can be attributed to factors like • Exact matching of English named entities • Good coverage of English words in our lexicons • A relatively lower performance on Hindi-English CLIR is due to • Low dictionary coverage • Query formulation was not complex enough IIIT-H @ FIRE-2008

Future work IIIT-H @ FIRE-2008

Future Work • Error analysis on per topic basis • Work on more complex query formulations • Work on other possible query translation techniques like • Building dictionaries from parallel corpora • Using web • Using Wikipedia IIIT-H @ FIRE-2008

Thank you!!! IIIT-H @ FIRE-2008

IIIT Hyderabad’s CLIR experiments for FIRE-2008