Lecture 28: CLIR

Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday 10:30 am - 12:00 pm Spring 2006 http://www.sims.berkeley.edu/academics/courses/is240/s06/ Lecture 28: CLIR Principles of Information Retrieval

Mini-TREC • Proposed Schedule • February 14-16 – Database and previous Queries • March 2 – report on system acquisition and setup • March 2, New Queries for testing… • April 20, Results due • April 25, Results and system rankings (sort of) • May 9, Group reports and discussion

Results (with bad runs)

With new runs…

Mean Average Precision

Today • Review • NLP for IR • Text Summarization • Cross-Language Information Retrieval • Introduction • Cross-Language EVIs Credit for some of the material in this lecture goes to Doug Oard (University of Maryland) and to Fredric Gey and Aitao Chen

Natural Language Processing and IR • The main approach in applying NLP to IR has been to attempt to address • Phrase usage vs individual terms • Search expansion using related terms/concepts • Attempts to automatically exploit or assign controlled vocabularies

NLP and IR • Much early research showed that (at least in the restricted test databases tested) • Indexing documents by individual terms corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically) • Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

NLP and IR • Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods • E.g. Use of syntactic role relations between terms has shown no improvement in performance over “bag of words” approaches

S NP VP P-N V Pred: RUN Agent:John John run General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu Syntactic Analysis Semantic Analysis Context processing Interpretation John is a student. He runs. Slide from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Using NLP • Strzalkowski (in Reader) Text NLP repres Dbase search TAGGER PARSER TERMS NLP:

Using NLP INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE The/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per

Using NLP TAGGED & STEMMED SENTENCE the/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np ./per

Using NLP PARSED SENTENCE [assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] [sub_ord[SINCE [[verb[INVADE]] [subject [np [n TANK][t_pos A] [adj [RUSSIAN]]]] [object [np [name [WISCONSIN]]]]]]]]]

Using NLP EXTRACTED TERMS & WEIGHTS President 2.623519 soviet 5.416102 President+soviet 11.556747 president+former 14.594883 Hero 7.896426 hero+local 14.314775 Invade 8.435012 tank 6.848128 Tank+invade 17.402237 tank+russian 16.030809 Russian 7.383342 wisconsin 7.785689

NLP & IR • Indexing • Use of NLP methods to identify phrases • Test weighting schemes for phrases • Use of more sophisticated morphological analysis • Searching • Use of two-stage retrieval • Statistical retrieval • Followed by more sophisticated NLP filtering

NLP & IR • New “Question Answering” track at TREC has been exploring these areas • Usually statistical methods are used to retrieve candidate documents • NLP techniques are used to extract the likely answers from the text of the documents

Introduction to CLIR • Slides from Doug Oard…

Cross-Language IR • Given a query expressed in one language • Find info that may be expressed in another • Electronic texts • Document images • Recorded speech [101] • Sign language English Query French Documents Retrieval System

Why Do Cross-Language IR? • When users can read several languages • Eliminates multiple queries • Query in most fluent language • Monolingual users can also benefit • If translations can be provided • If it suffices to know that a document exists • If text captions are used to search for images

What We Know • Dictionaries are very useful • Easily get to 50% of monolingual IR effectiveness • We can get to about 75% using: • Part-of-speech tags • Pseudo-relevance feedback • Phrase indexing • Multilingual training corpora are also useful • When the corpus is from the right domain

Related Issues • Multiscript text processing [12] • Character sets, writing system, direction, ... • Language identification [109] • Markup, detection • Language-specific processing [103] • Stemming, morphological roots, compounds, … • Document translation [51]

Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Controlled Vocabulary Free Text Knowledge-based Corpus-based Ontology-based Dictionary-based Term-aligned Sentence-aligned Document-aligned Unaligned Thesaurus-based Parallel Comparable

Free Text Developments • 1970, 1973 Salton • Hand coded bilingual dictionaries • 1990 Latent Semantic Indexing [53] • French/English using Hansard training corpus • 1994 European multilingual IR project [84] • Medium-scale recall/precision evaluation • 1996 SIGIR Cross-lingual IR workshop • And over 10 conferences and workshops since!

How Controlled Vocabulary Works • Thesaurus design [102] • Design a knowledge structure for domain • Assign a unique “descriptor” to each concept • Include “scope notes” and “lead-in vocabulary” • Document indexing • Read the document, assign appropriate descriptors • Retrieval • Select desired descriptors, use exact match retrieval

Multilingual Thesauri • Adapt the knowledge structure • Cultural differences influence indexing choices • Use language-independent descriptors • Matched to a unique term in each language • Three construction techniques [46] • Build it from scratch • Translate an existing thesaurus • Merge monolingual thesauri

Advantages over Free Text • High-quality concept-based indexing • Descriptors need not appear in the document • Knowledge-guided searching • Good thesauri capture expert domain knowledge • Excellent cross-language effectiveness • Up to 100% of monolingual effectiveness • Understandable retrieval results • Efficient implementation

Limitations • Costly to create • Design knowledge structure, index each document • Costly to maintain • Document indexing, vocabulary and concept change • Hard to use • Vocabulary choice, knowledge structure navigation • Limited scope • Domain must be chosen at design time

Query vs. Document Translation • Query translation • Very efficient for short queries • Not as big an advantage for relevance feedback • Hard to resolve ambiguous query terms • Document translation • May be needed by the selection interface • And supports adaptive filtering well • Slow, but only need to do it once per document • Poor scale-up to large numbers of languages

Document Translation Example • Approach • Select a single query language • Translate every document into that language • Perform monolingual retrieval • Long documents provide enough context • And many translation errors do not hurt retrieval • Much of the generation effort is wasted • And choosing a single translation can hurt

Query Translation Example • Select controlled vocabulary search terms • Retrieve documents in desired language • Form monolingual query from the documents • Perform a monolingual free text search English Web Pages French Query Terms Information Need Controlled Vocabulary Multilingual Text Retrieval System English Abstracts Thesaurus Alta Vista

Machine Readable Dictionaries • Based on printed bilingual dictionaries • Becoming widely available • Used to produce bilingual term lists • Cross-language term mappings are accessible • Sometimes listed in order of most common usage • Some knowledge structure is also present • Hard to extract and represent automatically • The challenge is to pick the right translation

Unconstrained Query Translation • Replace each word with every translation • Typically 5-10 translations per word • About 50% of monolingual effectiveness • Main problem is ambiguity • Example: Fly (English) • 8 word senses (e.g., to fly a flag) • 13 Spanish translations (enarbolar, ondear, …) • 38 English retranslations (hoist, brandish, lift…)

Phrase Indexing • Improves retrieval effectiveness two ways • Phrases are less ambiguous than single words • Idiomatic phrases translate as a single concept • Three ways to identify phrases • Semantic (e.g., appears in a dictionary) • Syntactic (e.g., parse as a noun phrase) • Cooccurrence (words found together often) • Semantic phrase results are impressive

Types of Bilingual Corpora • Parallel corpora: translation-equivalent pairs • Document pairs • Sentence pairs • Term pairs • Comparable corpora • Content-equivalent document pairs • Unaligned corpora • Content from the same domain

Generating Parallel Corpora • Parallel corpora are naturally domain-tuned • Finding one for the right domain may be hard • Alternative is to build one • Start with a monolingual corpus • Automatic machine translation for second language • Worthwhile when IR technique is faster than MT • If translation errors don’t hurt the IR technique • Good results with Latent Semantic Indexing

Pseudo-Relevance Feedback • Enter query terms in French • Find top French documents in parallel corpus • Construct a query from English translations • Perform a monolingual free text search French Query Terms Top ranked French Documents English Web Pages English Translations French Text Retrieval System Parallel Corpus Alta Vista

Similarity-Based Dictionaries • Automatically developed from aligned documents • Reflects language use in a specific domain • For each term, find most similar in other language • Retain only the top few (5 or so) • Performs as well as dictionary-based techniques • Evaluated on a comparable corpus of news stories [98] • Stories were automatically linked based on date and subject

Latent Semantic Indexing • Designed for better monolingual effectiveness • Works well across languages too [27] • Cross-language is just a type of term choice variation • Produces short dense document vectors • Better than long sparse ones for adaptive filtering • Training data needs grow with dimensionality • Not as good for retrieval efficiency • Always 300 multiplications, even for short queries

Cooccurrence-Based Dictionaries • Align terms using cooccurrence statistics • How often do a term pair occur in sentence pairs? • Weighted by relative position in the sentences • Retain term pairs that occur unusually often • Useful for query translation • Excellent results when the domain is the same • Also practical for document translation • Term use variations to reinforce good translations

Language Identification • Can be specified using metadata • Included in HTTP and HTML • Determined using word-scale features • Which dictionary gets the most hits? • Determined using subword features • Letter n-grams in electronic and printed text • Phoneme n-grams in speech

Research Directions • User needs assessment • Evaluation • Corpus construction • Word sense disambiguation • System integration • Probabilistic models • Adaptive filtering

Evaluation • Most critical need is for side by side tests • TREC-did this for French/German/Italian • Domain shift metric • Domain shift hurts corpus-based techniques • Need a way to measure severity of the shift • Test collections for adaptive filtering • From cross-language recall/precision evaluation

Corpus Construction • Corpus-based techniques have great potential • Parallel corpora are rare and expensive • Find it, reverse engineer the links, clean it up • Unlinked corpora are of limited value • Context linking research could change that [77] • Comparable corpora offer middle ground • Need to develop automatic linking techniques • Also need a metric for degree of comparability

TIDES • Find and retrieve information in unfamiliar languages Translate it into English • • Extract and correlate its content against other materials Find and Interpret Information Vital to National Security The Tamil National leader, Mr . V. Pirapaharan delivered a speech on 13 May 1998, the anniversary of the launch of Sri Lanka's biggest and longest assault on the Tamil homelands, describing how the LTTE defended against Sri Lanka's latest military ambitions. Here’s what he said: 1 62Million people in South India and Sri Lanka can read this

Today is a significant day in the history of our national liberation struggle, it marks the end of a year during which we have resisted and fought against the biggest ever offensive operation launched by the Sri Lankan armed forces code named " Jayasikuru ”... Translation Topic Detection Org Leader HQ Losses Extraction Sinhala Kumaratunga 3000 LTTE Pirapaharan Wanni 1300 Summarization The objective of the Sinhala chauvinists was to utilize maximum man power and fire power to destroy the military capability of the LTTE and to bring an end to the Tamil freedom movement. Before the launching of the operation " Jayasikuru " the Sri Lankan political and military high command miscalculated the military strength and determination of the LTTE . The Challenges (manual) • Liberation Tigers of Tamil Eelam (LTTE) • Sri Lanka • Velupillai Pirapaharan • Rebellion (experimental) (special-purpose) (key sentences) Tamil document Tamil document analysis 3

Cross-Language IR on the Web • http://www.clis.umd.edu/dlrg/clir/ • Most workshop proceedings • Lots of papers and project descriptions • Links to working systems • Including 2 web search engines • Useful linguistic resources • BibTeX for the attached bibliography

Lecture 28: CLIR

Lecture 28: CLIR

Presentation Transcript

Physics 101: Lecture 26 Conduction, Convection, Radiation

Lecture 16

Turbulent combustion (Lecture 3)

Turbulent combustion (Lecture 2)

Lecture 21

EE359 – Lecture 4 Outline

Lecture 7 Lattice design with MAD-X

Materials for Lecture 08

Lecture 20

Lecture 20

The PATENTSCOPE search system: CLIR

IIIT Hyderabad’s CLIR experiments for FIRE-2008

BCB 444/544

Some thoughts on failure analysis (success analysis) for CLIR

MIS 648 Presentation Notes: Lecture 14

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

BCB 444/544

About me, you and this lecture – What do you hope to gain from this lecture?

The New World

Lecture. 12. Unit. 2. Summery of Lecture 11.

Lecture 19