1 / 22

Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation

Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation. Romaric Besançon (1), Djamel Mostefa, Olivier Hamon, Khalid Choukri (2), Stéphane Chaudiron,Ismaïl Timimi (3). (1). (2). (3). Presentation of the INFILE track. Information Filtering Evaluation

Download Presentation

Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation Romaric Besançon (1), Djamel Mostefa, Olivier Hamon, Khalid Choukri (2), Stéphane Chaudiron,Ismaïl Timimi (3) (1) (2) (3)

  2. CEA LIST ELDA Univ. Lille 3 - Geriico Presentation of the INFILE track • Information Filtering Evaluation • Filter documents from a document stream according to long-term information needs (user profiles) • Second edition of the INFILE track in CLEF • 1 participant in 2008 • use same data in 2009

  3. CEA LIST ELDA Univ. Lille 3 - Geriico Presentation of the INFILE track • Mutlilingual • English, French, Arabic for both documents and topics • Two tasks • batch filtering • the whole corpus is given to the participants, which must return a list of filtered documents for each topic • adaptive filtering • documents are provided to the participants one at a time through an interactive procedure, with possible automated feedback to adapt the filtering system • closer to real usage in a context of competitive intelligence

  4. CEA LIST ELDA Univ. Lille 3 - Geriico Document Collection • Built from a corpus of news from the AFP (Agence France Presse) • almost 1.5 million news in French, English and Arabic • For the information filtering task: • 100 000 documents to filter, in each language • NewsML format • standard XML format for news (IPTC)

  5. CEA LIST ELDA Univ. Lille 3 - Geriico Document example document identifier keywords headline

  6. CEA LIST ELDA Univ. Lille 3 - Geriico Document example IPTC category AFP category content

  7. CEA LIST ELDA Univ. Lille 3 - Geriico Topics • 50 interest profiles • 20 profiles in the domain of science and technology • developped by CI professionals from French institutes INIST, ARIS, Oto Research, Digiport • 30 profiles of general interest • Profiles developed in French/English • Translated into Arabic

  8. CEA LIST ELDA Univ. Lille 3 - Geriico Topics • Each profile contains 5 fields: • title: a few words description • description: a one-sentence description • narrative: a longer description of what is considered a relevant document • keywords: a set of key words, key phrases or named entities • sample: a sample of relevant document (one paragraph) • Participants may use any subset of the fields for their filtering

  9. CEA LIST ELDA Univ. Lille 3 - Geriico Topic Example

  10. CEA LIST ELDA Univ. Lille 3 - Geriico Some topic examples • in general domain • in scientific information domain

  11. CEA LIST ELDA Univ. Lille 3 - Geriico Constitution of the corpus • Same corpus as INFILE@CLEF 2008 • With simulated feedback, we need the ground truth before the campaign • To build the corpus of documents to filter: • find relevant documents for the profiles in the original corpus • use a pooling technique with results of IR tools • 4 IR engines (Lucene, Indri, Zettair and CEA search engine), on several query fields combinations • iterative pooling using Mixture-of-Experts model

  12. CEA LIST ELDA Univ. Lille 3 - Geriico Constitution of the corpus (2) collection test collection • keep all documents assessed • documents returned by IR systems by judged not relevant form a set of difficult documents • choose random documents (noise) retrieved random assessed relevant

  13. CEA LIST ELDA Univ. Lille 3 - Geriico Corpus Number of relevant documents for each topic, in each language

  14. CEA LIST ELDA Univ. Lille 3 - Geriico Tasks • Batch filtering (02/04 - 30/05) • documents and topics available to participants • return list of filtered documents per topic (unordered) • Adaptive filtering (03/06 - 10/07) • topics available to participants • documents available one at a time (one pass test) • interactive protocol using a client-server architecture (webservice communication) • new document available only if previous one has been filtered • available simulated user feedback • for adapatation • limited number of feedbacks (200)

  15. CEA LIST ELDA Univ. Lille 3 - Geriico Evaluation metrics • Standard precision / recall / F-measure • Utility (from TREC filtering tracks) • per profile and averaged on all profiles • adaptivity: evolution curve (values computed each 10000 documents) • two experimental measures • originality • number of relevant documents a system uniquely retrieves • anticipation • inverse rank of first relevant document detected

  16. CEA LIST ELDA Univ. Lille 3 - Geriico INFILE Participants • 9 registered • 5 submitted runs • batch filtering • 3 participants, 12 runs • interactive filtering • 2 participants, 3 runs 27

  17. CEA LIST ELDA Univ. Lille 3 - Geriico INFILE results • Repartition of runs by task and languages batch adaptive eng fre ara fre ara eng

  18. CEA LIST ELDA Univ. Lille 3 - Geriico INFILE results – monolingual batch filtering

  19. CEA LIST ELDA Univ. Lille 3 - Geriico INFILE results – crosslingual / adaptive filtering 57% best mono 90% same team mono crosslingual better than monolingual

  20. CEA LIST ELDA Univ. Lille 3 - Geriico INFILE results – anticipation/originality strongly correlated with recall too few pariticipants

  21. CEA LIST ELDA Univ. Lille 3 - Geriico Approaches • Filtering • adapted Information Retrieval tools (Lucene) • SVM classifier with external ressources (GoogleNews) • textual similarity measures with thresholds • reasoning model (human plausible reasoning) • Adaptation • adaptation of selection thresholds • user feedback as parameter in reasoning model • Crosslingual • bilingual dictionaries • machine translation

  22. CEA LIST ELDA Univ. Lille 3 - Geriico Conclusion and after… • Increasing participation, reasonable result, but not enough… • Currently, no INFILE track planned for next year • interest in multilingual filtering ? • 2/3 runs on monolingual English • not enough participants for crosslingual to have comparative results • no funding • INFILE evaluation kit will be made available • corpus of documents / topics / relevance assessments • tools for the interactive adaptive filtering procedure • tools for the evaluation • distributed by ELDA

More Related