Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval Based on Query Keyword Translation: An Internet Search Application Atsushi Fujii and Tetsuya Ishikawa, International Journal of Computer Processing of Oriental Languages, 2000, 13:1, 1-13.

Cross Language Information Retrieval (CLIR) • The user presents queries in one language to retrieve documents in another language.

Previous Research • In the 1970 Salton showed that hand-crafted bilingual thesaurus was comparable with monolingual information retrieval in performance. • (Used documents stored on a machine that were already identified and classified.)

Information and Internet • The internet is a big storage facility for documents/web pages. • We can search these pages to retrieve information.

CLIR • Through the 1990’s CLIR systems tried to access multilingual web pages. • Systems performance/precision by 2000 only 50 to 75 percent of monolingual systems.

What is involved in CLIR over Internet • A CLIR needs a translation process along with a multilingual retrieval process. • Usually bilingual dictionaries, corpora, thesauri, and Machine Translation (MT) systems are used to translate queries and or documents.

Retrieval Methods • Query translation approach. • Document translation approach • Interlingual representation approach.

Query translation approach • Queries are translated into the document language prior to the retrieval process. • Three methods of translation are: • Dictionary-based • Corpus-based • Hybrid – corpora are used to resolve translation ambiguity in dictionaries.

Document translation approach • Translates the documents into the query language, prior to retrieval. • Methods • Uses MT systems to conduct full translation • Translates only terms indexed by way of dictionary-based translation method • Oard and Hackett (1997) showed empirically that full document translation method outperformed the Query Translation Method but is expensive.

Interlingual representation approach • Projects both queries and documents into language-independent representations • Thesaurus classes • Vector Space Models However these methods require manual alignment of bilingual thesauri/corpora. Carbonell et al. (1997) showed corpus-based query translation outperformed language-independent vector space models.

System

Probabilistic Translation • The authors use statistical models to perform translation to select the best word.

Conclusion • The method proposed improves on baseline CLIR systems, through the use of compound translation system.

Cross Language Information Retrieval (CLIR)