1 / 22

EQUER: The French Evaluation campaign of Question-Answering Systems

Christelle Ayache, ELDA ayache@elda.org Brigitte Grau, Anne Vilnat, LIMSI-CNRS {grau, vilnat}@limsi.fr. EVALDA / EQUER. EQUER: The French Evaluation campaign of Question-Answering Systems. 1. Presentation 2. Document collections 3. Question corpora 4. Evaluation process

woodrow
Download Presentation

EQUER: The French Evaluation campaign of Question-Answering Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Christelle Ayache, ELDA ayache@elda.org Brigitte Grau, Anne Vilnat, LIMSI-CNRS {grau, vilnat}@limsi.fr EVALDA / EQUER EQUER: The French Evaluation campaign of Question-Answering Systems

  2. 1. Presentation 2. Document collections 3. Question corpora 4. Evaluation process 5. Human assessment of the results 6. Scoring phase 7. Evaluation results 8. Conclusion Outline

  3. Organizer: ELDA Scientific project leader: LIMSI-CNRS Data and tools providers: - ELDA: General corpus - The CISMEF team, AP/HP: Medical data - Systal-Pertimm: Search engine Participants: - 3 private institutions: FranceTélécom, Sinequa, SynapseDéveloppement - 5 public laboratories: CEA-List, LIA, LIMSI-CNRS, STIM-AP/HP & LIPN, Université de Neuchâtel Presentation

  4. Objectives: To provide an evaluation framework for Question-Answering systems for the French language. To assess the state-of-the-art of this research activity in France and provide an up-to-date evaluation framework. To further the development of this activity by supplying corpus to researchers. 3 tasks were planned: A general task. A task in a restricted domain (medical domain). A web general task  not completed because it is very difficult to obtain the official rights. Presentation

  5. 2.1. The General corpus about 1.5 GB – about 560,000 docs constitution and normalization (ELDA) Press releases and formal reports from the French Senate : - Le Monde (1992-2000)  source : xml - Le Monde Diplomatique (1992-2000)  source : xml - French Swiss news agency releases (1994-1995)  source : xml - The French Senate (1996-2001) source : html 2. Document Collections

  6. 2.2. The Medical corpus about 140 MB – 6000 docs. constitution and normalization (STIM-AP/HP) Scientific articles and various references to “good medical practice” extracted from the following websites : - Santé Canada - Orphanet - CHU Rouen - FNLCC (Fédération Nationale de Lutte Contre le Cancer) 2. Document Collections

  7. 3. Corpora of questions 4 types of questions were defined: Factoid Who is the president of Italy? Definition  Who is Salvador Dali? List Which are the 4 main religions practiced in Hungary? Yes/No Is there a TGV line from Valencia to Paris?

  8. 16th July 2004 : ELDA provided the “Evaluation data” to the participants: Corpus of questions (General task) Corpus of questions (Medical task) For each question, the first 100 doc Ids returned by Pertimm. 23rd July 2004: The participants returned their systems’ results to ELDA.  Each participant could submit up to 2 runs per task. 4. Evaluation process

  9. 5.1.Evaluation specifications The passages were systematically evaluated. The evaluation of the short answers were not mandatory. There were 4 possible judgements for short answers: 0- Incorrect1- Correct2- Inexact3- Not justified There were 2 possible judgements for the passages: 0- Incorrect 1- Correct 5. Human assessment of the results

  10. 5.2. The General task Two students evaluated the results (1 month). Inter-judge agreement evaluation: less than 5% of disagreement  validation of the judgements In total, 12 submissions were evaluated. 5.3. The Medical task A specialist of the CISMEF (CHU Rouen) evaluated the results (a couple of weeks). In total, 7 submissions were evaluated. 5. Human assessment of the results

  11. Use of 2 standard metrics : MRR : Mean Reciprocal Rank “Factual” questions. “Definition” questions. “Yes/No” questions. NIAP : Non Interpolated Average Precision “List” questions. 6. Scoring phase

  12. 7. Evaluation results 7.1. General task, short answers and passages

  13. 7. Evaluation results 7.2. Medical task, short answers and passages

  14. 8.1. Conclusion -Participants: There appears to be major interest in the project on the part of various academics and industrials of this domain. Some participants had never taken part in this kind of evaluation. -Evaluation: EQUER has come up with an innovative type of questions: the Yes/No questions. EQUER is one of the few projects to draw upon the medical field for its QA task. EQUER is linked to CLEF, which, since 2003, has provided a specialized task for the evaluation of QA systems in Europe. 8. Conclusion and dissemination

  15. 8.2. Dissemination - Evaluation package All the data provided to the participants during the campaign (guidelines, text and question corpora, tools, reports, etc.). - This package will allow: Anyone to evaluate his system under the same conditions as those in EQUER. Users to compare their results to those of the 2004 EQueR evaluation. - The first version of the EQUER package will be available in June. 8. Conclusion and dissemination

  16. 7.3. General corpus: factual, définitions and yes/no EVALDA / EQUER

  17. 7.4. Medical corpus: factual, definitions and yes/no EVALDA / EQUER

  18. 8.5. TREC (USA) and NTCIR (Japon) EVALDA / EQUER

  19. EVALDA / EQUER

  20. EVALDA / EQUER

  21. 6. The scoring phase • MRR : Mean Reciprocal Rank This method takes into account the number of times a certain accurate answer is found. Ce critère tient compte du rang de la première bonne réponse trouvée. Si une bonne réponse est trouvée plusieurs fois, elle n’est comptée qu’une seule fois. Les systèmes ne trouvant pas la bonne réponse en rang 1 sont désavantagés.

  22. 6. The scoring phase • NIAP : Non Interpolated Average Precision This criterion takes into account both the Recall and Precision standard metrics as well as the order (rank) of the correct answers in the list. with : et :

More Related