Evaluating Multilingual Question Answering Systems at CLEF

Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner1, Danilo Giampiccolo1, Bernardo Magnini2, Anselmo Peñas3, Álvaro Rodrigo3, Richard Sutcliffe4 1 - CELCT, Trento, Italy, 2 - FBK, Trento, Italy 3 - UNED, Madrid, Spain 4- University of Limerick, Ireland

Outline • Background • QA at CLEF • Resources • Participation • Evaluation • Discussion • Conclusions

Background – QA • A Question Answering (QA) system takes as input a short natural language question and a document collection and produces an exact answer to the question, taken from the collection • In Monolingual QA – Q and A in same language • In Cross-Lingual QA – Q and A in different languages

Background – Monolingual Example Question: How many gold medals did Brian Goodel win in the 1979 Pan American Games? Answer: three gold medals Docid: LA112994-0248 Context: When comparing Michele Granger and Brian Goodell, Brian has to be the clear winner. In 1976, while still a student at Mission Viejo High, Brian won two Olympic gold medals at Montreal, breaking his own world records in both the 400- and 1,500-meter freestyle events. He went on to win three gold medals in the 1979 Pan American Games.

Background – Cross-Lingual Example Question: How high is the Eiffel Tower? Answer: 300 Meter Docid: SDA.950120.0207 Context: Der Eiffelturm wird jaehrlich von 4,5 bis 5 Millionen Menschen besucht. Das 300 Meter hohe Wahrzeichnen von Paris hatte im vergangenen Jahr vier neue Aufzuege von der zweiten bis zur vierten Etage erhalten.

Background – Grouped Questions • With grouped questions there are several on the same topic which may be linked even indirectly by co-reference: • Question: Who wrote the song "Dancing Queen"? • Question: When did it come out? • Question: How many people were in the group?

QA at CLEF - Eras • Origin was QA at Text REtrieval Conference, in 1999 onwards; term factoid coined there • At CLEF, there have been three Eras • Era 1 (2003-6): Ungrouped; mainly factoid; monolingual newspapers; exact answers • Era 2: (2007-8): Grouped; mainly factoid; monolingual newspapers and Wikipedias; exact answers • Era 3: (2009-10): Ungrouped; factoid + others; multilingual aligned EU documents; passages and exact answers

QA at CLEF - Tasks

Resources - Documents • Originally various newspapers (different in each target language, but same years 94/95) • For Era-2 (linked questions) Wikipedia 2006 was added • With Era-3 changed to JRC-Acquis Corpus – European Agreements and Laws • In 2010 Europarl was added (partly transcribed debates from the European Parliament) • Acquis and Europarl are Parallel Aligned (Ha Ha)

Resources - Questions • In all years, questions are back-composed from target language corpus • They are carefully grouped into various categories (person, place etc etc) • However, they are not naturally occurring or real

Resources – Back Translation of Questions • Each group composes questions in their own language, with answers in their target document collection • They translate these into English (pivot language) • All resulting English translations are pooled • Each group translates English questions into their language • Eras 1 & 2: Questions in a given target language can be asked in any source language • Era 3: Questions in any target language can be asked in any source language (Ho Ho)

Resources – Back Trans Cont. • Eras 1 & 2: Each participating group is answering different questions, depending on the target language • Era 3: Each group is answering same questions • The Gold Standard comprising questions, answers and contexts in target language is probably the most interesting thing to come out of the QA at CLEF activity • The back translation paradigm was worked out for the first campaign

Participation

Evaluation - Measures • Right / Wrong / Unsupported / ineXact • These standard TREC measures have been used all along • Accuracy: Proportion of answers Right • MRR: Reciprocal of rank of first correct answer. Thus each answer contributes 1, 0.5, 0.33, or 0 • C@1: Rewards system for not answering wrongly • CWS: Rewards system for being confident of correct ans • K1: Also links correctness and confidence

Evaluation - Method • Originally, runs inspected individually by hand • LIM used Perl TREC tools incorporating double judging • WiQA group produced excellent web-based system allowing double judging • CELCT produced web-based system • Evaluation is very interesting work!

Evaluation - Results

Discussion – Era 1 (03-06) • Monolingual QA improved 49%->68% • The best system was for a different language each year! • Reason: Increasingly sophisticated techniques used, mostly learned from TREC, plus CLEF and NTCIR • Cross-Lingual QA remained 35-45% throughout • Reason: Required improvement in Machine Translation has not been realised by participants

Discussion – Era 2 (07-08) • Monolingual QA improved 54%->64% • However, range of results was greater, as only a few groups were capable of the more difficult task • Cross-Lingual QA deteriorated 42%->19%! • Reason: 42% was an isolated result and the general field was much worse

Discussion – Era 3 (09-10) • In 2009, task was only passage retrieval (easier) • However, documents are much more difficult than newspapers and questions reflect this • Monolingual Passage Retrieval was 61% • Cross-Lingual Passage Retrieval was 18%

Conclusions - General • A lot of groups around Europe and beyond have been able to participate in their own languages • Hence, the general capability in European languages has improved considerably – both systems and research groups • However, people are often interested in their own language only – i.e. Monolingual systems • Cross-lingual systems mostly X->EN or EN->X, i.e. to or from English • Many language directions are supported by us but not taken up

Conclusions – Resources & Tools • During the campaigns, very useful resources have been developed – Gold Standards for each year • These are readily available and can be used by groups to develop systems even if they did not participate in CLEF • Interesting tools for devising questions and evaluating results have also been produced

Conclusions - Results • Monolingual results have improved to the level of TREC English results • Thus new, more dynamic and more realistic QA challenges must be found for future campaigns • Cross-Lingual results have not improved to the same degree. High quality MT (on Named Entities especially) is not a solved problem and requires further attention

Evaluating Multilingual Question Answering Systems at CLEF