780 likes | 920 Views
Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday 10:30 am - 12:00 pm Spring 2006 http://www.sims.berkeley.edu/academics/courses/is240/s06/. Lecture 28: CLIR. Principles of Information Retrieval. Mini-TREC.
E N D
Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday 10:30 am - 12:00 pm Spring 2006 http://www.sims.berkeley.edu/academics/courses/is240/s06/ Lecture 28: CLIR Principles of Information Retrieval
Mini-TREC • Proposed Schedule • February 14-16 – Database and previous Queries • March 2 – report on system acquisition and setup • March 2, New Queries for testing… • April 20, Results due • April 25, Results and system rankings (sort of) • May 9, Group reports and discussion
Today • Review • NLP for IR • Text Summarization • Cross-Language Information Retrieval • Introduction • Cross-Language EVIs Credit for some of the material in this lecture goes to Doug Oard (University of Maryland) and to Fredric Gey and Aitao Chen
Today • Review • NLP for IR • Text Summarization • Cross-Language Information Retrieval • Introduction • Cross-Language EVIs Credit for some of the material in this lecture goes to Doug Oard (University of Maryland) and to Fredric Gey and Aitao Chen
Natural Language Processing and IR • The main approach in applying NLP to IR has been to attempt to address • Phrase usage vs individual terms • Search expansion using related terms/concepts • Attempts to automatically exploit or assign controlled vocabularies
NLP and IR • Much early research showed that (at least in the restricted test databases tested) • Indexing documents by individual terms corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically) • Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements
NLP and IR • Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods • E.g. Use of syntactic role relations between terms has shown no improvement in performance over “bag of words” approaches
S NP VP P-N V Pred: RUN Agent:John John run General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu Syntactic Analysis Semantic Analysis Context processing Interpretation John is a student. He runs. Slide from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
Using NLP • Strzalkowski (in Reader) Text NLP repres Dbase search TAGGER PARSER TERMS NLP:
Using NLP INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE The/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per
Using NLP TAGGED & STEMMED SENTENCE the/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np ./per
Using NLP PARSED SENTENCE [assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] [sub_ord[SINCE [[verb[INVADE]] [subject [np [n TANK][t_pos A] [adj [RUSSIAN]]]] [object [np [name [WISCONSIN]]]]]]]]]
Using NLP EXTRACTED TERMS & WEIGHTS President 2.623519 soviet 5.416102 President+soviet 11.556747 president+former 14.594883 Hero 7.896426 hero+local 14.314775 Invade 8.435012 tank 6.848128 Tank+invade 17.402237 tank+russian 16.030809 Russian 7.383342 wisconsin 7.785689
NLP & IR • Indexing • Use of NLP methods to identify phrases • Test weighting schemes for phrases • Use of more sophisticated morphological analysis • Searching • Use of two-stage retrieval • Statistical retrieval • Followed by more sophisticated NLP filtering
NLP & IR • New “Question Answering” track at TREC has been exploring these areas • Usually statistical methods are used to retrieve candidate documents • NLP techniques are used to extract the likely answers from the text of the documents
Today • Review • NLP for IR • Text Summarization • Cross-Language Information Retrieval • Introduction • Cross-Language EVIs Credit for some of the material in this lecture goes to Doug Oard (University of Maryland) and to Fredric Gey and Aitao Chen
Introduction to CLIR • Slides from Doug Oard…
Cross-Language IR • Given a query expressed in one language • Find info that may be expressed in another • Electronic texts • Document images • Recorded speech [101] • Sign language English Query French Documents Retrieval System
Why Do Cross-Language IR? • When users can read several languages • Eliminates multiple queries • Query in most fluent language • Monolingual users can also benefit • If translations can be provided • If it suffices to know that a document exists • If text captions are used to search for images
What We Know • Dictionaries are very useful • Easily get to 50% of monolingual IR effectiveness • We can get to about 75% using: • Part-of-speech tags • Pseudo-relevance feedback • Phrase indexing • Multilingual training corpora are also useful • When the corpus is from the right domain
Related Issues • Multiscript text processing [12] • Character sets, writing system, direction, ... • Language identification [109] • Markup, detection • Language-specific processing [103] • Stemming, morphological roots, compounds, … • Document translation [51]
Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Controlled Vocabulary Free Text Knowledge-based Corpus-based Ontology-based Dictionary-based Term-aligned Sentence-aligned Document-aligned Unaligned Thesaurus-based Parallel Comparable
Free Text Developments • 1970, 1973 Salton • Hand coded bilingual dictionaries • 1990 Latent Semantic Indexing [53] • French/English using Hansard training corpus • 1994 European multilingual IR project [84] • Medium-scale recall/precision evaluation • 1996 SIGIR Cross-lingual IR workshop • And over 10 conferences and workshops since!
How Controlled Vocabulary Works • Thesaurus design [102] • Design a knowledge structure for domain • Assign a unique “descriptor” to each concept • Include “scope notes” and “lead-in vocabulary” • Document indexing • Read the document, assign appropriate descriptors • Retrieval • Select desired descriptors, use exact match retrieval
Multilingual Thesauri • Adapt the knowledge structure • Cultural differences influence indexing choices • Use language-independent descriptors • Matched to a unique term in each language • Three construction techniques [46] • Build it from scratch • Translate an existing thesaurus • Merge monolingual thesauri
Advantages over Free Text • High-quality concept-based indexing • Descriptors need not appear in the document • Knowledge-guided searching • Good thesauri capture expert domain knowledge • Excellent cross-language effectiveness • Up to 100% of monolingual effectiveness • Understandable retrieval results • Efficient implementation
Limitations • Costly to create • Design knowledge structure, index each document • Costly to maintain • Document indexing, vocabulary and concept change • Hard to use • Vocabulary choice, knowledge structure navigation • Limited scope • Domain must be chosen at design time
Query vs. Document Translation • Query translation • Very efficient for short queries • Not as big an advantage for relevance feedback • Hard to resolve ambiguous query terms • Document translation • May be needed by the selection interface • And supports adaptive filtering well • Slow, but only need to do it once per document • Poor scale-up to large numbers of languages
Document Translation Example • Approach • Select a single query language • Translate every document into that language • Perform monolingual retrieval • Long documents provide enough context • And many translation errors do not hurt retrieval • Much of the generation effort is wasted • And choosing a single translation can hurt
Query Translation Example • Select controlled vocabulary search terms • Retrieve documents in desired language • Form monolingual query from the documents • Perform a monolingual free text search English Web Pages French Query Terms Information Need Controlled Vocabulary Multilingual Text Retrieval System English Abstracts Thesaurus Alta Vista
Machine Readable Dictionaries • Based on printed bilingual dictionaries • Becoming widely available • Used to produce bilingual term lists • Cross-language term mappings are accessible • Sometimes listed in order of most common usage • Some knowledge structure is also present • Hard to extract and represent automatically • The challenge is to pick the right translation
Unconstrained Query Translation • Replace each word with every translation • Typically 5-10 translations per word • About 50% of monolingual effectiveness • Main problem is ambiguity • Example: Fly (English) • 8 word senses (e.g., to fly a flag) • 13 Spanish translations (enarbolar, ondear, …) • 38 English retranslations (hoist, brandish, lift…)
Phrase Indexing • Improves retrieval effectiveness two ways • Phrases are less ambiguous than single words • Idiomatic phrases translate as a single concept • Three ways to identify phrases • Semantic (e.g., appears in a dictionary) • Syntactic (e.g., parse as a noun phrase) • Cooccurrence (words found together often) • Semantic phrase results are impressive
Types of Bilingual Corpora • Parallel corpora: translation-equivalent pairs • Document pairs • Sentence pairs • Term pairs • Comparable corpora • Content-equivalent document pairs • Unaligned corpora • Content from the same domain
Generating Parallel Corpora • Parallel corpora are naturally domain-tuned • Finding one for the right domain may be hard • Alternative is to build one • Start with a monolingual corpus • Automatic machine translation for second language • Worthwhile when IR technique is faster than MT • If translation errors don’t hurt the IR technique • Good results with Latent Semantic Indexing
Pseudo-Relevance Feedback • Enter query terms in French • Find top French documents in parallel corpus • Construct a query from English translations • Perform a monolingual free text search French Query Terms Top ranked French Documents English Web Pages English Translations French Text Retrieval System Parallel Corpus Alta Vista
Similarity-Based Dictionaries • Automatically developed from aligned documents • Reflects language use in a specific domain • For each term, find most similar in other language • Retain only the top few (5 or so) • Performs as well as dictionary-based techniques • Evaluated on a comparable corpus of news stories [98] • Stories were automatically linked based on date and subject
Latent Semantic Indexing • Designed for better monolingual effectiveness • Works well across languages too [27] • Cross-language is just a type of term choice variation • Produces short dense document vectors • Better than long sparse ones for adaptive filtering • Training data needs grow with dimensionality • Not as good for retrieval efficiency • Always 300 multiplications, even for short queries
Cooccurrence-Based Dictionaries • Align terms using cooccurrence statistics • How often do a term pair occur in sentence pairs? • Weighted by relative position in the sentences • Retain term pairs that occur unusually often • Useful for query translation • Excellent results when the domain is the same • Also practical for document translation • Term use variations to reinforce good translations
Language Identification • Can be specified using metadata • Included in HTTP and HTML • Determined using word-scale features • Which dictionary gets the most hits? • Determined using subword features • Letter n-grams in electronic and printed text • Phoneme n-grams in speech
Research Directions • User needs assessment • Evaluation • Corpus construction • Word sense disambiguation • System integration • Probabilistic models • Adaptive filtering
Evaluation • Most critical need is for side by side tests • TREC-did this for French/German/Italian • Domain shift metric • Domain shift hurts corpus-based techniques • Need a way to measure severity of the shift • Test collections for adaptive filtering • From cross-language recall/precision evaluation
Corpus Construction • Corpus-based techniques have great potential • Parallel corpora are rare and expensive • Find it, reverse engineer the links, clean it up • Unlinked corpora are of limited value • Context linking research could change that [77] • Comparable corpora offer middle ground • Need to develop automatic linking techniques • Also need a metric for degree of comparability
TIDES • Find and retrieve information in unfamiliar languages Translate it into English • • Extract and correlate its content against other materials Find and Interpret Information Vital to National Security The Tamil National leader, Mr . V. Pirapaharan delivered a speech on 13 May 1998, the anniversary of the launch of Sri Lanka's biggest and longest assault on the Tamil homelands, describing how the LTTE defended against Sri Lanka's latest military ambitions. Here’s what he said: 1 62Million people in South India and Sri Lanka can read this
Today is a significant day in the history of our national liberation struggle, it marks the end of a year during which we have resisted and fought against the biggest ever offensive operation launched by the Sri Lankan armed forces code named " Jayasikuru ”... Translation Topic Detection Org Leader HQ Losses Extraction Sinhala Kumaratunga 3000 LTTE Pirapaharan Wanni 1300 Summarization The objective of the Sinhala chauvinists was to utilize maximum man power and fire power to destroy the military capability of the LTTE and to bring an end to the Tamil freedom movement. Before the launching of the operation " Jayasikuru " the Sri Lankan political and military high command miscalculated the military strength and determination of the LTTE . The Challenges (manual) • Liberation Tigers of Tamil Eelam (LTTE) • Sri Lanka • Velupillai Pirapaharan • Rebellion (experimental) (special-purpose) (key sentences) Tamil document Tamil document analysis 3
Cross-Language IR on the Web • http://www.clis.umd.edu/dlrg/clir/ • Most workshop proceedings • Lots of papers and project descriptions • Links to working systems • Including 2 web search engines • Useful linguistic resources • BibTeX for the attached bibliography
Today • Review • NLP for IR • Text Summarization • Cross-Language Information Retrieval • Introduction • Cross-Language EVIs Credit for some of the material in this lecture goes to Doug Oard (University of Maryland) and to Fredric Gey and Aitao Chen