210 likes | 297 Views
Overview of the Multilingual Question Answering Track. Danilo Giampiccolo. Outline. Tasks Test set preparation Participants Evaluation Results Final considerations Future perspectives. QA 2006: Organizing Committee. ITC-irst (Bernardo Magnini): main coordinator
E N D
Overview of the Multilingual Question Answering Track Danilo Giampiccolo QA@CLEF 2006 Workshop
Outline • Tasks • Test set preparation • Participants • Evaluation • Results • Final considerations • Future perspectives QA@CLEF 2006 Workshop
QA 2006: Organizing Committee • ITC-irst (Bernardo Magnini): main coordinator • CELCT (D. Giampiccolo, P. Forner): general coordination, Italian • DFKI (B. Sacalenau): German • ELDA/ELRA (C. Ayache): French • Linguateca (P. Rocha): Portuguese • UNED (A. Penas): Spanish • U. Amsterdam (Valentin Jijkoun): Dutch • U. Limerick (R. Sutcliff): English • Bulgarian Academy of Sciences (P. Osenova): Bulgarian • Only Source Languages: • Depok University of Indonesia (M. Adriani): Indonesian • IASI, Romania (D. Cristea): Romanian • Wrocław University of Technology (J. Pietraszko): Polish QA@CLEF 2006 Workshop
QA@CLEF-06: Tasks • Main task: • Monolingual: the language of the question (Source language) and the language of the news collection (Target language) are the same • Cross-lingual: the questions were formulated in a language different from that of the news collection • One pilot task: • WiQA: coordinated by Maarten de Rijke • Two exercises: • Answer Validation Exercise (AVE): coordinated by Anselmo Peñas • Real Time: a “time-constrained” QA exercise coordinated by the University of Alicante (coordinated by Fernando Llopis) QA@CLEF 2006 Workshop
Data set: Question format 200 Questions of three kinds • FACTOID (loc, mea, org, oth, per, tim;ca. 150): • What party did Hitler belong to? • DEFINITION (ca. 40):Who is Josef Paul Kleihues? • reduced in number (-25%) • two new categories added: • Object: What is a router? • Other: What is a tsunami? • LIST (ca. 10): Name works by Tolstoy • Temporally restricted (ca. 40): by date, by period, by event • NIL (ca. 20): questions that do not have any known answer in the target document collection • input format: question type (F, D, L) not indicated NEW! NEW! NEW! NEW! QA@CLEF 2006 Workshop
Data set: run format • Multiple answers:from one to ten exact answers per question • exact = neither more nor less than the information required • each answer has to be supported by • docid • one to ten text snippets justifying the answer (substrings of the specified document giving the actual context) NEW! NEW! QA@CLEF 2006 Workshop
Activated Tasks (at least one registered participant) • 11 Source languages (10 in 2005) • 8 Target languages (9 in 2005) • No Finnish task / New languages: Polish and Romanian QA@CLEF 2006 Workshop
Activated Tasks • questions were not translated in all the languages • Gold Standard: questions in multiple languages only for tasks were there was at least one registered participant NEW! More interest in cross-linguality QA@CLEF 2006 Workshop
Participants QA@CLEF 2006 Workshop
List of participants Industrial Companies QA@CLEF 2006 Workshop
Submitted runs QA@CLEF 2006 Workshop
1 snippet 2 snippets 3 snippets > 4 snippets Number of answers and snippets per question Number of RUNS with respect to number of answers 1 answer between 2 and 5 answers more than 5 answers Number of SNIPPETS for each answer QA@CLEF 2006 Workshop
Evaluation • As in previous campaigns • runs manually judged by native speakers • each answer: Right, Wrong, ineXact, Unsupported • up to two runs for each participating group • Evaluation measures • Accuracy (for F,D); main evaluation score, calculated for the FIRST ANSWER only • excessive workload: some groups could manually assess only one answer (the first one) per question • 1 answer: Spanish and English • 3 answers: French • 5 answers: Dutch • all answers: Italian, German, Portoguese • P@N for List questions Additional evaluation measures • K1 measure • Confident Weighted Score (CWS) • Mean Reciprocal Rank (MRR) NEW! QA@CLEF 2006 Workshop
Question Overlapping among Languages 2005-2006 QA@CLEF 2006 Workshop
Results: Best and Average scores * 49,47 * This result is still under validation. QA@CLEF 2006 Workshop
Best results in 2004-2005-2006 * 22,63 * This result is still under validation. QA@CLEF 2006 Workshop
Participants in 2004-2005-2006: compared best results QA@CLEF 2006 Workshop
List questions • Best: 0.8333 (Priberam, Monolingual PT) • Average: 0.138 Problems • Wrong classification of List Questions in the Gold Standard • Mention a Chinese writer is not a List question! • Definition of List Questions • “closed” List questions asking for a finite number of answers Q: What are the names of the two lovers from Verona separated by family issues in one of Shakespeare’s plays? A: Romeo and Juliet. • “open” List questions requiring a list of items as answer Q: Name books by Jules Verne. A: Around the World in 80 Days. A:Twenty Thousand Leagues Under The Sea. A:Journey to the Centre of the Earth. QA@CLEF 2006 Workshop
Final considerations • Increasing interest in multilingual QA • More participants (30, + 25%) • Two new languages as source (Romanian and Polish) • More activated tasks (24, they were 23 in 2005) • More submitted runs (77, +13%) • More cross-lingual tasks (35, +31.5%) • Gold Standard: questions not translated in all languages • No possibility of activating tasks at the last minutes • Useful as reusuable resource: available in the near future. QA@CLEF 2006 Workshop
Final considerations:2006 main task innovations • Multiple answers: • good response • limited capacity of assessing large numbers of answers. • feedback welcome from participants • Supporting snippets: • faster evaluation • feedback from participants • “F/D/L/” labels not given in the input format: • positive, as apparently there was no real impact on • List questions QA@CLEF 2006 Workshop
Future perspective: main task • For discussion: • Romanian as target • Very hard questions (implying reasoning and multiple document answers) • Allow collaboration among different systems • Partial automated evaluation (right answers) QA@CLEF 2006 Workshop