Cross-Language Information Retrieval

Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard

What Do People Search For? • Searchers often don’t clearly understand • The problem they are trying to solve • What information is needed to solve the problem • How to ask for that information • The query results from a clarification process • Dervin’s “sense making”: Need Gap Bridge

End-user Search Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Intermediated Search Q3 Formalized Need Q4 Compromised Need (Query)

Design Strategies • Foster human-machine synergy • Exploit complementary strengths • Accommodate shared weaknesses • Divide-and-conquer • Divide task into stages with well-defined interfaces • Continue dividing until problems are easily solved • Co-design related components • Iterative process of joint optimization

Human-Machine Synergy • Machines are good at: • Doing simple things accurately and quickly • Scaling to larger collections in sublinear time • People are better at: • Accurately recognizing what they are looking for • Evaluating intangibles such as “quality” • Both are pretty bad at: • Mapping consistently between words and concepts

Process/System Co-Design

Predict Nominate IR System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Document Examination Document Source Reselection Delivery Supporting the Search Process Source Selection Choose

IR System Query Formulation Query Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery Supporting the Search Process Source Selection

Search Component Model Utility Human Judgment Information Need Document Query Formulation Query Document Processing Query Processing Representation Function Representation Function Query Representation Document Representation Comparison Function Retrieval Status Value

Relevance • Relevance relates a topic and a document • Duplicates are equally relevant, by definition • Constant over time and across users • Pertinence relates a task and a document • Accounts for quality, complexity, language, … • Utility relates a user and a document • Accounts for prior knowledge

“Okapi” Term Weights TF component IDF component

term frequency document frequency query term query document length document average document length term frequency in query A Ranking Function: Okapi BM25

Estimating TF and DF for Query Terms f1 f2 f3 f4 0.4 20 5 2 50 50 40 30 200 0.3 e1 0.4 0.3 0.2 0.1 0.2 0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9 0.1 0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58

Learning to Translate • Lexicons • Phrase books, bilingual dictionaries, … • Large text collections • Translations (“parallel”) • Similar topics (“comparable”) • Similarity • Similar pronunciation, similar users • People

Hieroglyphic Demotic Greek

Statistical Machine Translation Señora Presidenta , había pedido a la administración del Parlamento que garantizase Madam President , I had asked the administration to ensure that

Bidirectional: Unidirectional: merveilles//0.92 merveille//0.03 emerveille//0.03 merveilleusement//0.02 se//0.31 demande//0.24 demander//0.08 peut//0.07 merveilles//0.04 question//0.02 savoir//0.02 on//0.02 bien//0.01 merveille//0.01 pourrait//0.01 si//0.01 sur//0.01 me//0.01 t//0.01 emerveille//0.01 ambition//0.01 merveilleusement//0.01 veritablement//0.01 cinq//0.01 hier//0.01 Bidirectional Translation wonders of ancient world (CLEF Topic 151)

Experiment Setup • Test collections • Document processing • Stemming, accent-removal (CLEF French) • Word segmentation, encoding conversion (TREC Chinese) • Stopword removal (all collections) • Training statistical translation models (GIZA++) Parallel corpus Europarl FBIS et al. Languages English-French English-Chinese # of sentence pairs 672,247 1,583,807 Models (iterations) M1(10), HMM(5), M4(5) M1(10)

Pruning Translations Translations Cumulative Probability Threshold 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 f1(0.32) f2(0.21) f3(0.11) f4(0.09) f5(0.08) f6(0.05) f7(0.04) f8(0.03) f9(0.03) f10(0.02) f11(0.01) f12(0.01) f1 f1 f1 f1 f1 f2 f1 f2 f1 f2 f3 f1 f2 f3 f4 f1 f2 f3 f4 f5 f1 f2 f3 f4 f5 f6 f7 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12

Q D Unidirectional without Synonyms (PSQ) CLEF French TREC-5,6 Chinese • Statistical significance vs monolingual (Wilcoxon signed rank test) • CLEF French: worse at peak • TREC-5,6 Chinese: worse at peak

(Q) (D) v.s. Q D Bidirectional with Synonyms (DAMM) CLEF French TREC-5,6 Chinese • DAMM significantly outperformed PSQ • DAMM is statistically indistinguishable from monolingual at peak • IMM: nearly as good as DAMM for French, but not for Chinese

Indexing Time Dictionary-based vector translation, single Sun SPARC in 2001

Key Capabilities • Map across languages • For human understanding • For automated processing The Problem Space • Retrospective search • Web search • Specialized services (medicine, law, patents) • Help desks • Real-time filtering • Email spam • Web parental control • News personalization • Real-time interaction • Instant messaging • Chat rooms • Teleconferences

Making a Market • Multitude of potential applications • Retrospective search, email, IM, chat, … • Natural consequence of language diversity • Limiting factor is translation readability • Searchability is mostly a solved problem • Leveraging human translation has potential • Translation routing, volunteers, cacheing

Cross-Language Information Retrieval