跨語言資訊檢索導論

跨語言資訊檢索導論 Hsin-Hsi Chen (陳信希) Department of Computer Science and Information Engineering National Taiwan University

Outline • Multilingual Environments • What is Cross-Language Information Retrieval? • Major Problems in CLIR • Major Approaches in CLIR • Case Study: CLIR in NPDM • Summary

Multilingual Collections • There are 6,703 languages listed in the Ethnologue • Digital libraries • OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languages • World Wide Web • Around 40% of Internet users do not speak English, however, 80% of Web sites are still in English

真實世界語言使用人口 ( http://www.g11n.com/faq.htm) 西班牙語孟加拉語阿拉伯語中文英語日語葡萄牙語印度語俄語

荷蘭語 葡萄牙語義大利語西班牙語韓文瑞典語中文法語德語日語 (Statistics from Euro-Marketing Associates, 1998)

中文人口 比例(6.1%) < 法文人口比例(8.8%) (1998年) (Statistics from Euro-Marketing Associates, 1999) http://www.glreach.com/globstats/

網路世界語言使用人口

網際網路內容 (Network Wizards Jan 99 Internet Domain Survey) 33,878 1,687 1,684 654 546 546 473 458 432 英語 40%的Internet使用者不懂英文，但是80% 的Internet內容是英文西班牙語瑞典語日語法語中文芬蘭語荷蘭語德語

(Source: http://www.emarketer.com)

What is Cross-Language Information Retrieval? • Definition: Select information in one language based on queries in another. • Terminologies • Cross-Language Information Retrieval(ACM SIGIR 96 Workshop on Cross-Linguistic Information Retrieval) • Translingual Information Retrieval(Defense Advanced Research Project Agency - DARPA)

Generalization: Multi- & Cross- Lingual Information Access

MLIR Applications • Multilingual information access in multilingual country, organization, enterprise, etc. • Cross- language information retrieval for users who read a second language (large passive vocabulary) but are not able to formulate good queries (small active vocabulary). • Monolingual users may retrieve images by taking advantage of multilingual captions. • Monolingual users may retrieve documents and have them translated (automatically or manually) in their language.

Why is Cross- Language Information Retrieval Important? • More information workers with less time require fast access to global resources • global B2B interactions (virtual enterprises) • global B2C interactions (online trading, travelling) • time critical information (translation comes too late)

History • 1970 Salton runs retrieval experiments with a small English/ German dictionary • 1972 Pevzner shows for English and Russian that a controlled thesaurus can be used effectively for query term translation • 1978 ISO Standard 5964 for developing multilingual thesauri (revised in 1985) • 1990 Latent Semantic Indexing (LSI) applied to CLIR

History (Continued) • 1994 1st PhD thesis on CLIR by Khaled Radwan • 1996 Similarity thesaurus applied to CLIR (ETH Zurich) • 1996 Dictionary based retrieval applied to CLIR (Umass & XEROX Grenoble) • 1997 Generalized Vector Space Model (GVSM) applied to CLIR (CMU)

History (Continued) • 1997 CLIR (Cross- Language Information Retrieval) track starts within TREC • 1998 NTCIR starts in Japan • 1999 TIDES (Translingual Information Detection, Extraction, and Summarization) starts in U. S. • 2000 CLEF starts in Europe

An Architecture of Multilingual Information Access

Major Problems of CLIR • Queries and documents are in different languages. • translation • Words in a query may be ambiguous. • disambiguation • Queries are usually short. • expansion

Major Problems of CLIR (Continued) • Queries may have to be segmented. • segmentation • A document may be in terms of various languages. • language identification

Enhancing TraditionalInformation Retrieval Systems • Which part(s) should be modified for CLIR? Documents Queries (1) (3) Document Representation Query Representation (2) (4) Comparison

Enhancing Traditional Information Retrieval Systems (Continued) • (1): text translation • (2): vector translation • (3): query translation • (4): term vector translation • (1) and (2), (3) and (4): interlingual form

What are the Problems? • Ambiguous terms (e.g., performance) • Multiword phrases may correspond to single-word phrases (e. g. South Africa => 南非，Südafrika) • Coverage of the vocabulary • There is not a one-to-one mapping between two languages • Translating queries automatically (lack of syntax) • Translating documents automatically (performance, …) • Computing mixed result lists

Cross-Language Information Retrieval

Query Translation Based CLIR Translation Device English Query Chinese Query Monolingual Chinese Retrieval System Retrieved Chinese Documents

Translating the 400 Millionnon-English Pages of the WWW • ... would take 100’000 days (300 years) on one fast PC. Or, 1 month on 3’600 PC’s.

Knowledge-Based • Examples • Subject Thesaurus • Hierarchical and associative relations. • Unique term assigned to each node. • Concept List • Term space partitioned into concept spaces. • Term List • List of cross-language synonyms. • Lexicon • Machine readable syntax and/or semantics.

Ontology-Based Approaches • Exploit complex knowledge representations e.g., EuroWordNet • A Proposal for Conceptual Indexing using EuroWordNet

Dictionary-Based Approaches • Exploit machine-readable dictionaries. • Problems • translation ambiguity + target polysemy • coverage (unknown words, abbreviations, ...)

Dictionary-Based Approaches(Continued) • Issue 1: selection strategy • Select all. • Select N randomly. • Select best N. • Issue 2: which level • word • phrase

Selection Strategy: Select All • Hull and Grefenstette 1996 • Take concatenation of all term translation.E: politically motivated civil disturbancesF: troubles civils a caractere politiquetrouble - turmoil, discord, trouble, unrest, disturbance, disordercivil - civil, civilian, courteouscaractere - character, naturepolitique - political, diplomatic, politician, policy • Original English (0.393) vs. Automatic word-based transfer dictionary (0.235): 59.8%. • errors: multi-word expressions and ambiguity

Selection Strategy: Select All(Continued) • Davis 1997 (TREC5) • Replace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary. • Monolingual (0.2895) vs. All-equivalent substitution (0.1422): 49.12%

Evaluation Method • Average Precision (5-, 9-, 11-points) • Model TREC Spanish Corpus Mono IR Engine Spanish Query TREC Spanish Corpus Bilingual Dictionary Spanish Equivalents Mono IR Engine English Query TREC Spanish Corpus POS Bilingual Dictionary Spanish Equivalents by POS Mono IR Engine English Query

Selection Strategy: Select N • Simple word-by-word translation • Each query term is replaced by the word or group of words given for the first sense of the term’s definition. • 50-60% drop in performance (average precision)

Selection Strategy: Select N(Continued) • word/phrase translation • Take at most three translations of each word, one from each of the first three senses. Take phrase translation if appearing in dictionary. • 30-50% worse than good translation • Well-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements. • WBW (0.0244), phrasal (0.0148), good phrasal (0.0610) -39.3% +150.3%

Selection Strategy: Select Best N • Hayashi, Kikui and Susaki 1997 • search for a dictionary entry corresponding to the longest sequence of words from left to right • choose the most frequently used word (or phrases) in a text corpus collected from WWW • no report for this query translation approach • Davis 1997 (TREC5) • POS disambiguation • Monolingual (0.2895) vs. All-equivalent substitution (0.1422) vs. POS disambiguation (0.1949): near 67.3%

Corpus-Based Approaches • Categorization • Term-Aligned • Sentence-Aligned • Document-Aligned (Parallel, Comparable) • Unaligned • Usage • Setup Thesaurus • Vector Mapping

Term-Aligned Corpora • Fine-grained alignment in parallel corpora • Oard 1996 • Term alignment is a challenging problem. English Query Parallel Binlingual Corpus Machine Translation System Spanish Query Translation Tables Cooccurrance Statistics

Sentence-Aligned Corpora • Davis & Dunning 1996 (TREC4) • High-frequency Terms

Brief Summary • dictionary-based methods • Specialized vocabulary not in the dictionaries will not be translated. • Ambiguities will add extraneous terms to the query. • parallel/comparable corpora-based methods • Parallel corpora are not always available. • Available corpora tend to be relative small or to cover only a small number of subjects. • Performance is dependent on how well the corpora are aligned.

Brief Summary (Continued) • Dictionaries are very useful. • Achieve 50% on their own • Parallel corpora have limitations. • Domain shifts • Term alignment accuracy • Dictionaries and corpora are complementary. • Dictionaries provide broad and shallow coverage. • Corpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.

Hybrid Methods • What knowledge can be employed? • lexical knowledge • corpus knowledge • ...

Hybrid Methods (Continued) • Query Expansion • Issue 1: context • pseudo relevance feedback (local feedback)::A query is modified by the addition of terms found in the top retrieved documents. • local context analysis::Queries are expanded by the addition of the top ranked concepts from the top passages.

Hybrid Methods (Continued) • Issue 2: when • before query translation • after query translation

Hybrid Methods (Continued) Original Spanish TREC Queries human translation English (BASE) Queries • Ballesteros & Croft 1997 query expansion automatic dictionary translation Spanish Queries English Queries automatic dictionary translation query expansion Spanish Queries Spanish Queries INQUERY

Hybrid Methods (Continued) • Performance Evaluation • pre-translationMRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1139) +33.5% +38.5% • post-translationMRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1% • combined pre- and post-translationMRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358) +51.0% +65.0% • 32% below a monolingual baseline

Cross-Language Evaluation Forum • A collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST) • Extension of CLIR track at TREC (1997-1999)

Main Goals • Promote research in cross-language system development for European languages by providing an appropriate infrastructure for: • CLIR system evaluation, testing and tuning • Comparison and discussion of results

CLEF 2000 Task Description • Four evaluation tracks in CLEF 2000 • multilingual information retrieval • bilingual information retrieval • monolingual (non-English) information retrieval • domain-specific IR

Case Study: CLIR for NPDM

3M in Digital Libraries/Museums • Multi-media • Selecting suitable media to represent contents • Multi-linguality • Decreasing the language barriers • Multi-culture • Integrating multiple cultures

跨語言資訊檢索導論