EQUER: The French Evaluation campaign of Question-Answering Systems

Christelle Ayache, ELDA ayache@elda.org Brigitte Grau, Anne Vilnat, LIMSI-CNRS {grau, vilnat}@limsi.fr EVALDA / EQUER EQUER: The French Evaluation campaign of Question-Answering Systems

1. Presentation 2. Document collections 3. Question corpora 4. Evaluation process 5. Human assessment of the results 6. Scoring phase 7. Evaluation results 8. Conclusion Outline

Organizer: ELDA Scientific project leader: LIMSI-CNRS Data and tools providers: - ELDA: General corpus - The CISMEF team, AP/HP: Medical data - Systal-Pertimm: Search engine Participants: - 3 private institutions: FranceTélécom, Sinequa, SynapseDéveloppement - 5 public laboratories: CEA-List, LIA, LIMSI-CNRS, STIM-AP/HP & LIPN, Université de Neuchâtel Presentation

Objectives: To provide an evaluation framework for Question-Answering systems for the French language. To assess the state-of-the-art of this research activity in France and provide an up-to-date evaluation framework. To further the development of this activity by supplying corpus to researchers. 3 tasks were planned: A general task. A task in a restricted domain (medical domain). A web general task  not completed because it is very difficult to obtain the official rights. Presentation

2.1. The General corpus about 1.5 GB – about 560,000 docs constitution and normalization (ELDA) Press releases and formal reports from the French Senate : - Le Monde (1992-2000)  source : xml - Le Monde Diplomatique (1992-2000)  source : xml - French Swiss news agency releases (1994-1995)  source : xml - The French Senate (1996-2001) source : html 2. Document Collections

2.2. The Medical corpus about 140 MB – 6000 docs. constitution and normalization (STIM-AP/HP) Scientific articles and various references to “good medical practice” extracted from the following websites : - Santé Canada - Orphanet - CHU Rouen - FNLCC (Fédération Nationale de Lutte Contre le Cancer) 2. Document Collections

3. Corpora of questions 4 types of questions were defined: Factoid Who is the president of Italy? Definition  Who is Salvador Dali? List Which are the 4 main religions practiced in Hungary? Yes/No Is there a TGV line from Valencia to Paris?

16th July 2004 : ELDA provided the “Evaluation data” to the participants: Corpus of questions (General task) Corpus of questions (Medical task) For each question, the first 100 doc Ids returned by Pertimm. 23rd July 2004: The participants returned their systems’ results to ELDA.  Each participant could submit up to 2 runs per task. 4. Evaluation process

5.1.Evaluation specifications The passages were systematically evaluated. The evaluation of the short answers were not mandatory. There were 4 possible judgements for short answers: 0- Incorrect1- Correct2- Inexact3- Not justified There were 2 possible judgements for the passages: 0- Incorrect 1- Correct 5. Human assessment of the results

5.2. The General task Two students evaluated the results (1 month). Inter-judge agreement evaluation: less than 5% of disagreement  validation of the judgements In total, 12 submissions were evaluated. 5.3. The Medical task A specialist of the CISMEF (CHU Rouen) evaluated the results (a couple of weeks). In total, 7 submissions were evaluated. 5. Human assessment of the results

Use of 2 standard metrics : MRR : Mean Reciprocal Rank “Factual” questions. “Definition” questions. “Yes/No” questions. NIAP : Non Interpolated Average Precision “List” questions. 6. Scoring phase

7. Evaluation results 7.1. General task, short answers and passages

7. Evaluation results 7.2. Medical task, short answers and passages

8.1. Conclusion -Participants: There appears to be major interest in the project on the part of various academics and industrials of this domain. Some participants had never taken part in this kind of evaluation. -Evaluation: EQUER has come up with an innovative type of questions: the Yes/No questions. EQUER is one of the few projects to draw upon the medical field for its QA task. EQUER is linked to CLEF, which, since 2003, has provided a specialized task for the evaluation of QA systems in Europe. 8. Conclusion and dissemination

8.2. Dissemination - Evaluation package All the data provided to the participants during the campaign (guidelines, text and question corpora, tools, reports, etc.). - This package will allow: Anyone to evaluate his system under the same conditions as those in EQUER. Users to compare their results to those of the 2004 EQueR evaluation. - The first version of the EQUER package will be available in June. 8. Conclusion and dissemination

7.3. General corpus: factual, définitions and yes/no EVALDA / EQUER

7.4. Medical corpus: factual, definitions and yes/no EVALDA / EQUER

8.5. TREC (USA) and NTCIR (Japon) EVALDA / EQUER

EVALDA / EQUER

6. The scoring phase • MRR : Mean Reciprocal Rank This method takes into account the number of times a certain accurate answer is found. Ce critère tient compte du rang de la première bonne réponse trouvée. Si une bonne réponse est trouvée plusieurs fois, elle n’est comptée qu’une seule fois. Les systèmes ne trouvant pas la bonne réponse en rang 1 sont désavantagés.

6. The scoring phase • NIAP : Non Interpolated Average Precision This criterion takes into account both the Recall and Precision standard metrics as well as the order (rank) of the correct answers in the list. with : et :

EQUER: The French Evaluation campaign of Question-Answering Systems