760 likes | 1.22k Views
Cross Lingual Information Retrieval (CLIR). Rong Jin. The Problem. Increasing pressure for accessing information in foreign language: find information written in foreign languages read and interpret that information merge it with information in other languages
E N D
The Problem • Increasing pressure for accessing information in foreign language: • find information written in foreign languages • read and interpret that information • merge it with information in other languages • Need for multilingual information access
Why Cross Lingual IR is Important? • Internet is no longer monolingual and non-English content is growing rapidly • Non-English speakers represent the fastest growing group of new internet users • In 1997, 8.1 million Spanish speaking users • In 2000, 37 million ……..
2000 2005 English English Confidential, unpublished information • Manning & Napier Information Services 2000
2. Multilingual Text Processing • Character encoding • Language recognition • Tokenization • Stop word removal • Feature normalization (stemming) • Part-of-speech tagging • Phrase identification
Character Encoding • Language (alphabet) specific native encoding: • Chinese GB, Big5, • Western European ISO-8859-1 (Latin1) • Russian KOI-8, ISO-8859-5, CP-1251 • UNICODE (ISO/IEC 10646) • UTF-8 variable-byte length • UTF-16, UCS-2 fixed double-byte
Tokenization • Punctuation separated from words – incl. word separation characters. • “The train stopped.” “The”, “train”, “stopped”, “.” • String split into lexical units - incl. Segmentation (Chinese) and compound-splitting (German)
Chinese Segmentation Frank Petroleum Detection
German Segmentation • Unrestricted compounding in German • Abendnachrichtensendungsblock • Use compound analysis together with CELEX German dictionary (360,000 words) • Treuhandanstalt { treuhand, anstalt } • Use n-gram representation • Treuhandanstalt {Treuha, reuhan, treuhand, euhand, … }
Language Barrier Query Representation Document Representation Document User Query CLIR - Approaches • Machine Translation • Bilingual Dictionaries • Parallel/Comparable Corpora Marco Pantani of Italy became the first Italian to win the Tour de France of 1998 … 誰在1998 年贏得環法自行車大賽
Machine Translation • Translate all documents into the query language Chinese Documents English Documents Chinese Queries Machine Translation Lucene
Machine Translation (MT) • Translate all documents into the query language • Not viable on large collections (MT is computationally expensive) • Not viable if there are many possible query languages Chinese Documents English Documents Chinese Queries Machine Translation Lucene
Machine Translation • Translate the query into languages of the content being searched English Documents English Queries Chinese Queries Machine Translation Lucene
Machine Translation • Translate the query into languages of the content being searched • Query translation is inadequate for CLIR • no context for accurate translation • system selects preferred target term English Documents English Queries Chinese Queries Machine Translation Lucene
Example of Translating Queries Who won the Tour de France in 1998?
Using Dictionaries • Bilingual machine-readable dictionaries (in-house or commercial) • Look-up query terms in dictionary and replace with translations in document languages English Documents Bilingual Dictionary English Queries Chinese Queries Lucene
Using Dictionaries Problems • ambiguity • many terms are out-of-vocabulary • lack of multiword terms • phrase identification • bilingual dictionary needed for every query-document language pair of interest
Word Sense Disambiguation The sign for independent press to disappear
Using Corpora • Parallel Corpora • translation equivalent • e.g. UN corpus in French, Spanish & English • Comparable Corpora • Similar for topic, style, time etc. • Hong Kong TV broadcast news in both Chinese and English
Using Corpora How to bridge the language barrier using the parallel corpora ? d1 a a c e d2 b c d a Query: A E d3 e d a
Translate Query using Parallel Corpus (I) d1 a a c e d2 b c d a Query: A E d3 e d a
Translate Query using Parallel Corpus (I) d1 a a c e d2 b c d a Query: A E d3 e d a Query: ce
Translate Query using Parallel Corpus (I) d1 a a c e d2 b c d a Query: A E d3 e d a Query: ce
Translate Query using Parallel Corpus (II) • Learn word-to-word translation probabilities from parallel corpa • Compute the relevance of a document d to a given query q by estimating the probability of translating document d into query q
Translate Query using Parallel Corpus (II) Word-to-Word Translation Probabilities Q = (A E), d1 = (a a c e)
Translate Query using Parallel Corpus (II) Word-to-Word Translation Probabilities Q = (A E), d1 = (a a c e)
Translate Query using Parallel Corpus (II) Word-to-Word Translation Probabilities Q = (A E), d1 = (a a c e)
Translate Query using Parallel Corpus (II) Q = (A E), d1 = (a a c e)
Translate Query using Parallel Corpus (II) Q = (A E), d1 = (a a c e)
Translate Query using Parallel Corpus (II) How to obtain the translation probabilities ?
Approach I: Co-occurrence Counting • Co-occurrence based translation model e.g. p(A|a) = co(a, A) / occur(a) = 4/4 = 1
Approach I: Co-occurrence Counting • P(B|c) = co(B, c)/occ(c) = 2/4 = 0.5
Approach I: Co-occurrence Counting • Any problem ?
Approach I: Co-occurrence Counting • Many large translation probabilities • Usually one word of one language corresponds motly to a single word in another language
Approach I: Co-occurrence Counting • Many large translation probabilities • Usually one word of one language corresponds motly to a single word in another language • We may over-count the co-occurrence statistics
Approach I: Overcounting • co(A, a) = 4 implies that all occurrence of ‘A’ is due to the occurrence of ‘a’
Approach I: Overcounting • If we believe that the first two occurrences of ‘A’ is due to ‘a’, then, co(A, b) = 1, not 3 • But, we have no idea if the first two occurrences of ‘A’ is due to ’a’ X x
How to Compute Co-occurrence ? • IBM statistical translation model • There are translation models published by IBM research • We will only discuss IBM Translation Model I • It uses an iterative procedure to eliminate the over counting problem
Step 1: Compute co-occurrence • Assume that translation probabilities are proportional to co-occurrence
Step 2: Compute Conditional Prob. • Assume that translation probabilities are proportional to co-occurrence
Step 3: Re-estimate co-occurrence • ‘A’ can be caused by one of the words ‘b’, ‘c’, ‘a’, ‘d’ • co(A,a) for sentence 1 should be computed by taking account of the competition
Step 3: Re-estimate co-occurrence co(A,a) = 0.41 + 0.37 + 0.48 + 0 + 0.36 = 1.62