Evaluating Cross-language Information Retrieval Systems

Evaluating Cross-language Information Retrieval Systems Carol Peters IEI-CNR

Outline • Why IR System Evaluation is Important • Evaluation programs • An Example SPINN Seminar, Copenhagen 26-27 October 2001

What is an IR System Evaluation Campaign? • An activity which tests the performance of different systems on a given task (or set of tasks) under standard conditions • Permits contrastive analysis of approaches/technologies SPINN Seminar, Copenhagen 26-27 October 2001

How well does system meet information need? • System evaluation: how good are document rankings? • User-based evaluation: how satisfied is the user? SPINN Seminar, Copenhagen 26-27 October 2001

Why we need Evaluation • evaluation permits hypotheses to be validated and progress assessed • evaluation helps to identify areas where more R&D is needed • evaluation saves developers time and money CLIR systems are still in experimental stage Evaluation is particularly important! SPINN Seminar, Copenhagen 26-27 October 2001

CLIR System Evaluation is Complex CLIR systems consist of integration of components and technologies • need to evaluate single components • need to evaluate overall system performance • need to distinguish methodological aspects from linguistic knowledge SPINN Seminar, Copenhagen 26-27 October 2001

Technology vs. Usage Evaluation Usage Evaluation: • shows value of a technology for user • determines the technology thresholds that are indispensable for specific usage • provides directions for choice of criteria for technology evaluation Influence of language and culture on usability of technology needs to be understood SPINN Seminar, Copenhagen 26-27 October 2001

Organising an Evaluation Activity • select control task(s) • provide data to test and tune systems • define protocol and metrics to be used in results assessment Aim is an objective comparison between systems and approaches SPINN Seminar, Copenhagen 26-27 October 2001

Test Collection • Set of documents - must be representative of task of interest; must be large • Set of “topics” - statement of user needs from which system data structure (query) is extracted • Relevance judgments – judgments vary by assessor but no evidence that differences affect comparative evaluation of systems SPINN Seminar, Copenhagen 26-27 October 2001

A variety of different systems retrieve the top 1000 documents for each topic. Assessors create topics. Systems are evaluated using relevance judgments. Form pools of unique documents from all submissions which the assessors judge for relevance. Using Pooling to Create Large Test Collections Ellen Voorhees – CLEF 2001 Workshop SPINN Seminar, Copenhagen 26-27 October 2001

Cross-language Test Collections Consistency harder to obtain than for monolingual • parallel or comparable document collections • multiple assessors per topic creation and relevance assessment (for each language) • must take care when comparing different language evaluations (e.g., cross run to mono baseline) Pooling harder to coordinate • need to have large, diverse pools for all languages • retrieval results are not balanced across languages Taken from Ellen Voorhees – CLEF 2001 Workshop SPINN Seminar, Copenhagen 26-27 October 2001

Evaluation Measures • Recall: measures ability of system to find all relevant items recall = • Precision: measures ability of system to find only relevant items precision = no. of rel. items retrieved ---------------------------------- no. of rel. items in collection no. of rel. items retrieved ---------------------------------- total no. of items retrieved Recall-Precision Graph is used to compare systems SPINN Seminar, Copenhagen 26-27 October 2001

Main CLIR Evaluation Programs • TIDES: sponsors TREC (Text REtrieval Conferences) and TDT (Topic Detection and Tracking) - Chinese-English tracks in 2000; TREC focussing on English/French - Arabic in 2001 • NTCIR: Nat.Inst. for Informatics, Tokyo. Chinese-English; Japanese-English C-L tracks • AMARYLLIS: focused on French; 98-99 campaign included C-L track; 3rd campaign begins Sept.01 • CLEF: Cross Language Evaluation Forum - C-L evaluation for European languages SPINN Seminar, Copenhagen 26-27 October 2001

Cross-Language Evaluation Forum • Funded by DELOS Network of Excellence for Digital libraries and US National Institute for Standards and Technology (200-2001) • Extension of CLIR track at TREC (1997-1999) • Coordination is distributed - national sites for each language in multilingual collection SPINN Seminar, Copenhagen 26-27 October 2001

CLEF Partners (2000-2001) • Eurospider, Zurich, Switzerland (Peter Schäuble, Martin Braschler) • IEEC-UNED, Madrid, Spain (Felisa Verdejo, Julio Gonzalo) • IEI-CNR, Pisa, Italy (Carol Peters) • IZ Sozialwissenschaften, Bonn, Germany (Michael Kluck) • NIST, Gaithersburg MD, USA (Donna Harman, Ellen Voorhees) • University of Hildesheim, Germany (Christa Womser-Hacker) • University of Twente, The Netherlands (Djoerd Hiemstra) SPINN Seminar, Copenhagen 26-27 October 2001

CLEF - Main Goals Promote research by providing an appropriate infrastructure for: • CLIR system evaluation, testing and tuning • comparison and discussion of results • building of test-suites for system developers SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001Task Description Four main evaluation tracks in CLEF 2001: • multilingual information retrieval • bilingual IR • monolingual (non-English) IR • domain-specific IR plus • experimental track for interactive C-L systems SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001Data Collection • Multilingual comparable corpus of news agencies and newspaper documents for six languages (DE,EN,FR,IT,NL,SP). Nearly 1 million documents • Common set of 50 topics (from which queries are extracted) created in 9 European languages (DE,EN,FR,IT,NL,SP+FI,RU,SV) and 3 Asian languages (JP,TH,ZH) SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001Creating the Queries • Title: European Industry • Description: What factors damage the competitiveness of European industry on the world's markets? • Narrative: Relevant documents discuss factors that render European industry and manufactured goods less competitive with respect to the rest of the world, e.g. North America or Asia. Relevant documents must report data for Europe as a whole rather than for single European nations. Queries are extracted from topics: 1 or more fields SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001Creating the Queries • Distributed activity (Bonn, Gaithersburg, Pisa, Hildesheim, Twente, Madrid) • Each group produced 13-15 queries (topics), 1/3 local, 1/3 European, 1/3 international • Topic selection at meeting in Pisa (50 topics) • Topics were created in DE, EN,FR,IT,NL,SP and additionally translated to SV,RU,FI and TH,JP,ZH • Cleanup after topic translation SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001 Multilingual IR Topics either DE,EN,FR,IT FI,NL,SP,SV, RU,ZH,JP,TH documents English German French Italian Spanish Participant’s Cross-Language Information Retrieval System One result list of DE, EN, FR,IT and SP documents ranked in decreasing order of estimated relevance SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001 Bilingual IR Task: query English or Dutch target document collections Goal: retrieve documents for target language, listing results in ranked list Easier task for beginners ! SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001 Monolingual IR Task:querying document collections in FR|DE|IT|NL|SP Goal:acquire better understanding of language- dependent retrieval problems • different languages present different retrieval problems • issues involved include word order, morphology, diacritic characters, language variants SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001Domain-Specific IR Task: querying a structured database from a vertical domain (social sciences) in German • German/English/Russian thesaurus and English translations of document titles • Monolingual or cross-language task Goal: understand implications of querying in domain-specific context SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001Interactive C-L Task: interactive document selection in an “unknown” target language Goal: evaluation of results presentation rather than system performance SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001: Participation 34 participants, 15 different countries N.America Asia Europe SPINN Seminar, Copenhagen 26-27 October 2001

Details of Experiments SPINN Seminar, Copenhagen 26-27 October 2001

Runs per Topic Language SPINN Seminar, Copenhagen 26-27 October 2001

Topic Fields SPINN Seminar, Copenhagen 26-27 October 2001

CMU Eidetica Eurospider * Greenwich U HKUST Hummingbird IAI * IRIT * ITC-irst * JHU-APL * Kasetsart U KCSL Inc. Medialab Nara Inst. of Tech. National Taiwan U OCE Tech. BV SICS/Conexor SINAI/U Jaen Thomson Legal * TNO TPD * U Alicante U Amsterdam U Exeter CLEF 2001Participation • U Glasgow * • U Maryland * (interactive only) • U Montreal/RALI * • U Neuchâtel • U Salamanca * • U Sheffield * (interactive only) • U Tampere * • U Twente (*) • UC Berkeley (2 groups) * • UNED (interactive only) (* = also participated in 2000) SPINN Seminar, Copenhagen 26-27 October 2001

CLEF2001Approaches All traditional approaches used: • commercial MT systems (Systran, Babelfish, Globalink Power Translator, ) • both query and document translation tried • bilingual dictionary look-up(on-line and in-house tools) • aligned parallel corpora (web-derived) • comparable corpora (similarity thesaurus) • conceptual networks (Eurowordnet, ZH-EN wordnet) • multilingual thesaurus (domain-specific task) SPINN Seminar, Copenhagen 26-27 October 2001

CLEF2001Techniques Tested Text processing for multiple languages: • Porter stemmer, Inxight commercial stemmer, on-site tools • simple generic “quick&dirty” stemming • language independent stemming • separate stopword lists vs single list • morphological analysis • n-gram indexing, word segmentation, decompounding (e.g. Chinese, German) • use of NLP methods, e.g. phrase identification,morphosyntactic analysis SPINN Seminar, Copenhagen 26-27 October 2001

CLEF2001Techniques Tested Cross-language strategies included: • integration of methods (MT, corpora and MRDs) • pivot language to translate from L1 -> L2 (DE -> FR,SP,IT via EN) • N-gram based technique to match untranslatable words • prior and post-translation pseudo-relevance feedback (query expanded by associating frequent cooccurrences) • vector-based semantic analysis (query expanded by associating semantically similar terms) SPINN Seminar, Copenhagen 26-27 October 2001

CLEF2001Techniques Tested • Different strategies experimented for results merging • This remains still an unsolved problem SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2001 Workshop • Results of CLEF 2001 campaign presented at Workshop, 3-4 September 2001, Darmstadt, Germany • 50 researchers and system developers from academia and industry participated. • Working Notes containing preliminary reports and statistics on CLEF2001 experiments distributed. SPINN Seminar, Copenhagen 26-27 October 2001

CLEF-2001 vs. CLEF-2000 • Most participants were back • Less MT • More Corpus-Based • People really start to try each other’s ideas/methods: • corpus-based approaches (parallel web, alignments) • n-grams • combination approaches SPINN Seminar, Copenhagen 26-27 October 2001

“Effect” of CLEF • Many more European groups • Dramatic increase of work in stemming/decompounding (for languages other than English) • Work on mining the web for parallel texts • Work on merging (breakthrough still missing?) • Work on combination approaches SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2002 • Accompanying Measure under IST programme: Contract No. IST-2000-31002. October 2001 • CLEF Consortium • IEI-CNR, Pisa; ELRA/ELDA, Paris; Eurospider, Zurich; UNED, Madrid; NIST, USA; IZ Sozialwissenschaften, Bonn • Associated Members • University of Hildesheim, University of Twente, University of Tampere (?) SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2002Task Description Similar to CLEF 2001: • multilingual information retrieval • bilingual IR (not to English!) • monolingual (non-English) IR • domain-specific IR • interactive track Plus feasibility study for spoken document track (within DELOS – results reported at CLEF) Possible cooordination with Amaryllis SPINN Seminar, Copenhagen 26-27 October 2001

CLEF 2002Schedule • Call for Participation - November 2001 • Document release – 1 February 2002 • Topic Release – 1 April 2002 • Runs received - 15 June 2002 • Results communicated – 1 August 2002 • Paper for Working Notes - 1 September 2002 • Workshop - 19-20 September SPINN Seminar, Copenhagen 26-27 October 2001

Evaluation - Summing up • system evaluation is not a competition to find the best • evaluation provides opportunity to test, tune, and compare approaches in order to improve system performance • an evaluation campaign creates a community interested in examining the same issues and comparing ideas and experiences SPINN Seminar, Copenhagen 26-27 October 2001

Cross-Language Evaluation Forum For further information see: http://www.clef-campaign.org or contact: Carol Peters - IEI-CNR E-mail: carol@iei.pi.cnr.it SPINN Seminar, Copenhagen 26-27 October 2001

Evaluating Cross-language Information Retrieval Systems