1.04k likes | 1.05k Views
Cross-Language Retrieval. LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001. Agenda. Questions Overview The information The users Cross-Language Search User Interaction. The Grand Plan. Phase 1: What makes up an IR system? perspectives on the elephant
E N D
Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001
Agenda • Questions • Overview • The information • The users • Cross-Language Search • User Interaction
The Grand Plan • Phase 1: What makes up an IR system? • perspectives on the elephant • Phase 2: Representations • words, ratings • Phase 3: Beyond English text • ideas applied in many settings
A Driving Example • Visual History Foundation • Interviews with Holocaust survivors • 39 years’ worth of audio/video • 32 languages; accented, emotional speech • 30 people, 2 years : $12 million • Joint project: MALACH • VHF, IBM, JHU, UMD • http://www.clsp.jhu.edu/research/malach
Information Access Information Use Translingual Search Translingual Browsing Translation Select Examine Query Document
A Little (Confusing) Vocabulary • Multilingual document • Document containing more than one language • Multilingual collection • Collection of documents in different languages • Multilingual system • Can retrieve from a multilingual collection • Cross-language system • Query in one language finds document in another • Translingual system • Queries can find documents in any language
Who needs Cross-Language Search? • When users can read several languages • Eliminate multiple queries • Query in most fluent language • Monolingual users can also benefit • If translations can be provided • If it suffices to know that a document exists • If text captions are used to search for images
Motivations • Commerce • Security • Social
Global Internet Hosts Source: Network Wizards Jan 99 Internet Domain Survey
Global Web Page Languages Source: Jack Xu, Excite@Home, 1999
European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997
European Web Size Projection Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
Global Internet Audio Almost 2000 Internet-accessible Radio and Television Stations source: www.real.com, Feb 2000
13 Months Later About 2500 Internet-accessible Radio and Television Stations source: www.real.com, Mar 2001
User Needs Assessment • Who are the potential users? • What goals do we seek to support? • What language skills must we accommodate?
Global Languages Source: http://www.g11n.com/faq.html
Global Trade Billions of US Dollars (1999) Source: World Trade Organization 2000 Annual Report
Global Internet User Population 2000 2005 English English Chinese Source: Global Reach
Agenda • Questions • Overview • Cross-Language Search • User Interaction
Monolingual Searcher Cross-Language Searcher Choose Document-Language Terms Choose Query-Language Terms Infer Concepts Select Document-Language Terms Query The Search Process Author Choose Document-Language Terms Query-Document Matching Document
Some history: from controlled vocabular to free text • 1964 International Road Research • Multilingual thesauri • 1970 SMART • Dictionary-based free-text cross-language retrieval • 1978 ISO Standard 5964 (revised 1985) • Guidelines for developing multilingual thesauri • 1990 Latent Semantic Indexing • Corpus-based free-text translingual retrieval
Multilingual Thesauri • Build a cross-cultural knowledge structure • Cultural differences influence indexing choices • Use language-independent descriptors • Matched to language-specific lead-in vocabulary • Three construction techniques • Build it from scratch • Translate an existing thesaurus • Merge monolingual thesauri
Free Text CLIR • What to translate? • Queries or documents • Where to get translation knowledge? • Dictionary or corpus • How to use it?
Translingual Retrieval Architecture Chinese Term Selection Monolingual Chinese Retrieval 1: 0.72 2: 0.48 Language Identification Chinese Term Selection Chinese Query English Term Selection Cross- Language Retrieval 3: 0.91 4: 0.57 5: 0.36
Evidence for Language Identification • Metadata • Included in HTTP and HTML • Word-scale features • Which dictionary gets the most hits? • Subword features • Character n-gram statistics
Query-Language Retrieval Chinese Query Terms English Document Terms Monolingual Chinese Retrieval 3: 0.91 4: 0.57 5: 0.36 Document Translation
Example: Modular use of MT • Select a single query language • Translate every document into that language • Perform monolingual retrieval
Is Machine Translation Enough? TDT-3 Mandarin Broadcast News Systran Balanced 2-best translation
Document-Language Retrieval Chinese Query Terms Query Translation English Document Terms Monolingual English Retrieval 3: 0.91 4: 0.57 5: 0.36
Query vs. Document Translation • Query translation • Efficient for short queries (not relevance feedback) • Limited context for ambiguous query terms • Document translation • Rapid support for interactive selection • Need only be done once (if query language is same) • Merged query and document translation • Can produce better effectiveness than either alone
The Short Query Challenge Source: Jack Xu, Excite@Home, 1999
Interlingual Retrieval Chinese Query Terms Query Translation English Document Terms Interlingual Retrieval 3: 0.91 4: 0.57 5: 0.36 Document Translation
Wrong segmentation Which translation? No translation? Key Challenges in CLIR probe survey take samples cymbidium goeringii oil petroleum restrain
Sources of Evidence for Translation • Corpus statistics • Lexical resources • Algorithms • The user
Hieroglyphic Egyptian Demotic Greek
Types of Bilingual Corpora • Parallel corpora: translation-equivalent pairs • Document pairs • Sentence pairs • Term pairs • Comparable corpora: topically related • Collection pairs • Document pairs
Exploiting Parallel Corpora • Automatic acquisition of translation lexicons • Statistical machine translation • Corpus-guided translation selection • Document-linked techniques
Word alignment (GIZA) STRAND … cannot understand crew commands… ne comprenez pas les instructions de l’ equip… Association stats Chunk-level alignment Frequency-based thresholding Lexicon acquisition from the WWW 63K chunks 500K words 3378 document pairs 170K entries
Corpus-Guided Translation Selection • Rank translation alternatives for each term • pick English word e that maximizes Pr(e) • Pick English word e that maximizes Pr(e|c) • Pick English words e1…en maximizing Pr(e1…en|c1…cm) = statistical machine translation! • Unigram language models are easy to build • Can use the collection being searched • Limits uncommon translation and spelling error effects
Corpus-Based CLIR Example French Query Terms Top ranked French Documents Top ranked English Documents Parallel Corpus English Translations French IR System English IR System
Exploiting Comparable Corpora • Blind relevance feedback • Existing CLIR technique + collection-linked corpus • Lexicon enrichment • Existing lexicon + collection-linked corpus • Dual-space techniques • Document-linked corpus
Blind Relevance Feedback • Augment a representation with related terms • Find related documents, extract distinguishing terms • Multiple opportunities: • Before doc translation: Enrich the vocabulary • After doc translation: Mitigate translation errors • Before query translation: Improve the query • After query translation: Mitigate translation errors • Short queries get the most dramatic improvement
English Query Example: Post-Translation “Document Expansion” IR System Document to be Indexed Term Selection Top 5 IR System Results Single Document Term-to-Term Translation English Corpus Automatic Segmentation Mandarin Chinese Documents
Post-Translation Document Expansion Mandarin Newswire Text
Why Document Expansion Works • Story-length objects provide useful context • Ranked retrieval finds signal amid the noise • Selective terms discriminate among documents • Enrich index with low DF terms from top documents • Similar strategies work well in other applications • CLIR query translation • Monolingual spoken document retrieval
… Cross-Language Evaluation Forum … ? … Solto Extunifoc Tanixul Knadu … Lexicon Enrichment Similar techniques can guide translation selection
Lexicon Enrichment • Use a bilingual lexicon to align “context regions” • Regions with high coincidence of known translations • Pair unknown terms with unmatched terms • Unknown: language A, not in the lexicon • Unmatched: language B, not covered by translation • Treat the most surprising pairs as new translations • Not yet tested in a CLIR application
English Terms Spanish Terms E1 E2 E3 E4 E5 S1 S2 S3 S4 Doc 1 4 2 2 1 Doc 2 8 4 4 2 Doc 3 2 2 1 2 Doc 4 2 1 2 1 Doc 5 4 1 2 1 Learning From Document Pairs
Similarity “Thesauri” • For each term, find most similar in other language • Terms E1 & S1 (or E3 & S4) are used in similar ways • Treat top related terms as candidate translations • Applying dictionary-based techniques • Performed well on comparable news corpus • Automatically linked based on date and subject codes