310 likes | 400 Views
From Research to Application in Multilingual Information Access: The Contribution of Evaluation. Carol Peters ISTI-CNR, Pisa, Italy. From Research to Application in Multilingual Information Access: The Contribution of Evaluation. Carol Peters ISTI-CNR, Pisa, Italy. Outline.
E N D
From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol PetersISTI-CNR, Pisa, Italy LREC 2008
From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol PetersISTI-CNR, Pisa, Italy LREC 2008
Outline • What is MLIA/CLIR? • What is the State-of-the-Art? • Where are the Problems? • What is the Contribution of Evaluation? • Where are the Problems? • What more can we do? • From CLEF to TrebleCLEF LREC 2008
Europe’s Linguistic Diversity LREC 2008
There are 6,800 known languages spoken in 200 countries 2,261 have writing systems (the others are only spoken) Just 300 have some kind of language processing tools LREC 2008
What is MLIA? • MLIA related research regards the storage, access, retrieval and presentation of information in any of the world's languages. • Two main areas of interest: • multiple language access, browsing, display • cross-language information discovery and retrieval LREC 2008
Multi-Language Access, Browsing, Display The enabling technology: • character encoding • specific requirements of particular languages and scripts • internationalization & localization LREC 2008
Cross-Language Information Retrieval Crossing the language barrier… • querying of multilingual collection in one language against documents in many other languages… • filtering, selecting, ranking retrieved documents • presenting retrieved information in an interpretable and exploitable fashion LREC 2008
Language Barrier Query Representation Document Representation Document User Query The Problem LREC 2008
CLIR methods • How is it done? • Pre-process & index both documents and queries – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.) • Translate: queries or documents (or both) • Translation resources • Machine Translation (MT) • Parallel/comparable corpora • Bilingual Dictionaries • Multilingual Thesauri • Conceptual Interlingua • Find relevant documents in target collection(s) & present results LREC 2008
CLIR for Multimedia • Retrieval from a mixed media collection is non- trivial problem • Different media processed in different ways and suffer from different kinds of indexing errors: • spoken documents indexed using speech recognition • handwritten documents indexed using OCR • images indexed using significant features • Need for complex integration of multiple technologies • Need for merging of results from different sources LREC 2008
Main CLIR Difficulties (I) • Language identification • Morphology: inflection, derivation, compounding, … • OOV terms, e.g. proper names, terminology • Multi-word concepts, e.g. phrases and idioms • Ambiguity, e.g. polysemy • Handling many languages: L1 -> Ln • Merging results from different sources / media • Presenting the results in useful fashion LREC 2008
Main CLIR Difficulties (II) • MLIA system need clever pre-processing of target collections (e.g. semantic analysis, classification, information extraction) • MLIA systems need intelligent post-processing of results: merging/ summarization / translation • MLIA systems need well-developed resources • Language Processing Tools • Language Resources • Resources are expensive to acquire, maintain, update LREC 2008
Cross-Language Evaluation Forum Objectives • Promote research and stimulate development of multilingual IR systems for European languages, through • Creation of evaluation infrastructure • Building of an MLIA/CLIR research community • Construction of publicly available test-suites Major Goal • Encourage development of truly multilingual, multimodal systems LREC 2008
Centre for the Evaluation of Human Language and Multimodal Communication Technologies (CELCT), Trento, Italy College of Information Studies and Institute for Advanced Computer Studies, U. Maryland, USA Dept. of Computer Science, U. Indonesia Depts. of Computer Science & Medical Informatics, RWTH Aachen U., Germany Dept. of Computer Science and Information Systems, U. Limerick, Ireland Dept. of Computer Science and Information Engineering, National U. Taiwan Dept. of Information Engineering, U. Padua, Italy Dept. of Information Sci, U. Hildesheim, Germany Dept. of Information Studies, U. Sheffield, UK Evaluations and Language Resources Distribution Agency Sarl, Paris, France Fondazione Bruno Kessler FBK-irst, Trento, Italy German Research Centre for Artificial Intelligence, DFKI, Saarbrücken, Germany Information and Language Processing Systems, U. Amsterdam, Netherlands IZ Bonn, Germany Inst. For Information technology, Hyderabad, India Inst. of Formal and Applied Linguistics, Charles University, Czech Rep LSI-UNED, Madrid, Spain Linguateca, Sintef, Oslo, Norway Linguistic Modelling Lab., Bulgarian Acad Sci Microsoft Research Asia NIST, USA Biomedial Informatics, Oregon Health and Science University, USA Research Computing Center of Moscow State U. Research Institute for Linguistics, Hungarian Academy of Sciences School of Computer Science and Mathematics, Victoria U., Australia School of Computing, DCU, Ireland UC Data Archive and School of Information Management and Systems, UC Berkeley, USA University "Alexandru Ioan Cuza", IASI, Romania U. Hospitals and U.of Geneva, Switzerland Vienna University of Technology, Austria CLEF Coordination Institutions contributing to the organisation of the different tracks of CLEF 2007 LREC 2008
Evolution of CLEF LREC 2008
CLEF Test Collections 2000 • News documents in 4 languages • GIRT German Social science database 2007 • CLEF multilingual comparable corpus of more than 3M news docs in 13 languages: CZ,DE,EN,ES,FI,FR,IT,NL,RU,SV,PT,BG and HU • GIRT-4 social science database in EN and DE, Russian ISISS collection; Cambridge Sociological Abstracts • Malach collection of conversational speech derived from the Shoah archives EN & CZ • EuroGOV, 3.5 M webpages crawled from European governmental sites • IAPR TC-12 photo database; PASCAL VOC 2006 training data • ImageCLEFmed radiological database consisting of 6 distinct datasets; • IRMA collection in EN & DE for automatic medical image annotation Each track creates topics/queries & relevance assessments in diverse languages LREC 2008
Promoting Research through Evaluation Text Retrieval (from 2000) Mono-, bi- and multilingual system performance tested using news documents (13 European languages) bilingual task testing on unusual language combinations multilingual system testing with many target languages advanced tasks to monitor improvement in system performance over time and focused on problem of merging results from different collections/languages “robust” task emphasized importance of stable performance over languages instead of high average performance Since 2006, queries in non-European languages (Indian sub-task) 2008: New tasks on library archives; Tasks on non-European target collections; robust task uses WSD data LREC 2008 20
Results: Cross-Language Text Retrieval Comparing bilingual results with monolingual baselines: • TREC-6, 1997: • EN→FR: 49% of best monolingual French system • EN→DE: 64% of best monolingual German system • CLEF 2002: • EN→FR: 83,4% of best monolingual French system • EN→DE: 85,6% of best monolingual German system • CLEF 2003 enforced the use of “unusual” language pairs: • IT→ES: 83% of best monolingual Spanish IR system • DE→IT: 87% of best monolingual Italian IR system • FR→NL: 82% of best monolingual Dutch IR system • CLEF 2007 best bilingual system 88% of best monolingual system LREC 2008
Other results:non-doc & non-text retrieval • Interactive CLEF • Cross-Lang. IR from a user-inclusive perspective • Multilingual Question Answering • 10 different target collections, Real-time exercise, answer validation, QA on speech transcripts • Geographical CLIR • Cross-language image retrieval • Tasks on photo and medical archives, tasks for retrieval and calssification • Cross-language spoken document & cross-language speech retrieval LREC 2008
CLEF Achievements • Stimulation of research activity in new, previously unexplored areas • Study and implementation of evaluation methodologies for diverse types of cross-language IR systems • Creation of a large set of empirical data about multilingual information access from the user perspective • Quantitative and qualitative evidence with respect to best practice in cross-language system development • Creation of reusable test collections for system benchmarking • Building of a strong, multidisciplinary research community BUT LREC 2008
BUT Notable lack of takeup by Application Communities LREC 2008
TrebleCLEF TrebleCLEF is a Coordination Action, funded under the 7FP from 2008 to 2009, which aims at: • continuing to promote the development of advanced multilingual multimedia information access systems • disseminating knowhow, tools, and resources to enable DL creators to make content and knowledge accessible, usable and exploitable over time, over media and over language boundaries i2010 DigitalLibraryInitiative LREC 2008
Objectives I TrebleCLEF will promote R&D and industrial take-up of multilingual, multimodal information access functionality in the following ways: • by continuing to support the annual CLEF system evaluation campaigns, with particular focus on: • user modeling, e.g. the requirements of different classes of users when querying multilingual information sources • language-specific experimentation, e.g. looking at differences across languages in order to derive best practices for each language • results presentation, e.g. how can results be presented in the most useful and comprehensible way to the user. LREC 2008
Objectives II • by constituting a scientific forum for the MLIA community of researchers enabling them to meet and discuss results, emerging trends, new directions • providing a scientific digital library to manage accessible the scientific data and experiments produced during the course of an evaluation campaign, providing tools to: • analyze, compare, and cite the data and experiments • curate, preserve, annotate, enrich them (promoting their re-use) LREC 2008
Objectives III • by acting as a virtual centre of competence providing a central reference point for anyone interested in studying or implementing MLIA functionality: • making publicly available sets of guidelines on best practices in MLIA (e.g. what stemmer to use, what stop list, what translation resources, how best to evaluate, etc., depending on the application requirements); • making tools and resources used in the evaluation campaigns freely available to a wider public whenever possible; otherwise providing links to where they can be acquired; • organising workshops, and/or tutorials and training sessions. LREC 2008
Approach • Evaluation • test collections and laboratory evaluation • user evaluation • log analysis • Best Practices & Guidelines • system-oriented aspects of MLIA applications • collaborative user studies • user-oriented aspects of MLIA interfaces • Dissemination and Training • tutorials • workshops • summer school LREC 2008
Consortium • ISTI-CNR, Pisa, Italy • University of Padua, Italy • University of Sheffield, United Kingdom • Universidad Nacional de Educación a Distancia, Spain • Zurich University of Applied Sciences, Switzerland • Centre for the Evaluation of Language Communication Technologies, Italy • Evaluations & Language resources Distribution Agency, France LREC 2008
Contacts • For further information see: • http://www.trebleclef.eu/ • or contact: • Carol Peters - ISTI-CNR • E-mail:carol.peters@isti.cnr.it LREC 2008