Cross Lingual Information Retrieval (CLIR)

Cross Lingual Information Retrieval (CLIR) Rong Jin

The Problem • Increasing pressure for accessing information in foreign language: • find information written in foreign languages • read and interpret that information • merge it with information in other languages • Need for multilingual information access

Why Cross Lingual IR is Important? • Internet is no longer monolingual and non-English content is growing rapidly • Non-English speakers represent the fastest growing group of new internet users • In 1997, 8.1 million Spanish speaking users • In 2000, 37 million ……..

2000 2005 English English Confidential, unpublished information • Manning & Napier Information Services 2000

2. Multilingual Text Processing • Character encoding • Language recognition • Tokenization • Stop word removal • Feature normalization (stemming) • Part-of-speech tagging • Phrase identification

Character Encoding • Language (alphabet) specific native encoding: • Chinese GB, Big5, • Western European ISO-8859-1 (Latin1) • Russian KOI-8, ISO-8859-5, CP-1251 • UNICODE (ISO/IEC 10646) • UTF-8 variable-byte length • UTF-16, UCS-2 fixed double-byte

Tokenization • Punctuation separated from words – incl. word separation characters. • “The train stopped.”  “The”, “train”, “stopped”, “.” • String split into lexical units - incl. Segmentation (Chinese) and compound-splitting (German)

Chinese Segmentation

Chinese Segmentation Frank Petroleum Detection

German Segmentation • Unrestricted compounding in German • Abendnachrichtensendungsblock • Use compound analysis together with CELEX German dictionary (360,000 words) • Treuhandanstalt  { treuhand, anstalt } • Use n-gram representation • Treuhandanstalt  {Treuha, reuhan, treuhand, euhand, … }

Language Barrier Query Representation Document Representation Document User Query CLIR - Approaches • Machine Translation • Bilingual Dictionaries • Parallel/Comparable Corpora Marco Pantani of Italy became the first Italian to win the Tour de France of 1998 … 誰在1998 年贏得環法自行車大賽

Machine Translation • Translate all documents into the query language Chinese Documents English Documents Chinese Queries Machine Translation Lucene

Machine Translation (MT) • Translate all documents into the query language • Not viable on large collections (MT is computationally expensive) • Not viable if there are many possible query languages Chinese Documents English Documents Chinese Queries Machine Translation Lucene

Machine Translation • Translate the query into languages of the content being searched English Documents English Queries Chinese Queries Machine Translation Lucene

Machine Translation • Translate the query into languages of the content being searched • Query translation is inadequate for CLIR • no context for accurate translation • system selects preferred target term English Documents English Queries Chinese Queries Machine Translation Lucene

Example of Translating Queries Who won the Tour de France in 1998?

Using Dictionaries • Bilingual machine-readable dictionaries (in-house or commercial) • Look-up query terms in dictionary and replace with translations in document languages English Documents Bilingual Dictionary English Queries Chinese Queries Lucene

Using Dictionaries Problems • ambiguity • many terms are out-of-vocabulary • lack of multiword terms • phrase identification • bilingual dictionary needed for every query-document language pair of interest

Word Sense Disambiguation

Word Sense Disambiguation The sign for independent press to disappear

Using Corpora • Parallel Corpora • translation equivalent • e.g. UN corpus in French, Spanish & English • Comparable Corpora • Similar for topic, style, time etc. • Hong Kong TV broadcast news in both Chinese and English

Using Corpora How to bridge the language barrier using the parallel corpora ? d1 a a c e d2 b c d a Query: A E d3 e d a

Translate Query using Parallel Corpus (I) d1 a a c e d2 b c d a Query: A E d3 e d a

Translate Query using Parallel Corpus (I) d1 a a c e d2 b c d a Query: A E d3 e d a Query: ce

Translate Query using Parallel Corpus (II) • Learn word-to-word translation probabilities from parallel corpa • Compute the relevance of a document d to a given query q by estimating the probability of translating document d into query q

Translate Query using Parallel Corpus (II) Word-to-Word Translation Probabilities Q = (A E), d1 = (a a c e)

Translate Query using Parallel Corpus (II) Q = (A E), d1 = (a a c e)

Translate Query using Parallel Corpus (II) How to obtain the translation probabilities ?

Approach I: Co-occurrence Counting

Approach I: Co-occurrence Counting • Co-occurrence based translation model e.g. p(A|a) = co(a, A) / occur(a) = 4/4 = 1

Approach I: Co-occurrence Counting • P(B|c) = co(B, c)/occ(c) = 2/4 = 0.5

Approach I: Co-occurrence Counting • Any problem ?

Approach I: Co-occurrence Counting • Many large translation probabilities • Usually one word of one language corresponds motly to a single word in another language

Approach I: Co-occurrence Counting • Many large translation probabilities • Usually one word of one language corresponds motly to a single word in another language • We may over-count the co-occurrence statistics

Approach I: Overcounting • co(A, a) = 4 implies that all occurrence of ‘A’ is due to the occurrence of ‘a’

Approach I: Overcounting

Approach I: Overcounting • If we believe that the first two occurrences of ‘A’ is due to ‘a’, then, co(A, b) = 1, not 3 • But, we have no idea if the first two occurrences of ‘A’ is due to ’a’ X x

How to Compute Co-occurrence ? • IBM statistical translation model • There are translation models published by IBM research • We will only discuss IBM Translation Model I • It uses an iterative procedure to eliminate the over counting problem

Step 1: Compute co-occurrence

Step 1: Compute co-occurrence • Assume that translation probabilities are proportional to co-occurrence

Step 2: Compute Conditional Prob. • Assume that translation probabilities are proportional to co-occurrence

Step 3: Re-estimate co-occurrence • ‘A’ can be caused by one of the words ‘b’, ‘c’, ‘a’, ‘d’ • co(A,a) for sentence 1 should be computed by taking account of the competition

Step 3: Re-estimate co-occurrence

Step 3: Re-estimate co-occurrence co(A,a) = 0.41 + 0.37 + 0.48 + 0 + 0.36 = 1.62

Step 3: Re-estimate co-occurrence

Cross Lingual Information Retrieval (CLIR)

Cross Lingual Information Retrieval (CLIR)

Presentation Transcript

Cross-lingual projection of Semantics

Cross-lingual Information Access by Natural Language

Cross-Lingual IR

Cross-Language Information Retrieval

Cross-language IR and statistical MT

Cross-Language Information Retrieval

The PATENTSCOPE search system: CLIR

Cross-Language IR: An Overview

Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English

Cross Language Information Retrieval (CLIR)

Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective

Evaluating Cross-language Information Retrieval Systems

Cross-lingual Information Extraction System Evaluation

CLIR

Vietnamese-English Cross Language Search Information Retrieval (CLIR) -

The Effect of Translation Quality in MT-Based Cross-Language Information Retrieval

Cross-Language Information Retrieval (CLIR)