140 likes | 289 Views
On the Evaluation of Snippet Selection for Information Retrieval. A. Overwijk , D. Nguyen, C. Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong. Contents. Properties of a good evaluation method Evaluation method of WebCLEF Approach Results Analysis Conclusion. Good evaluation method.
E N D
On the Evaluation of Snippet Selection for Information Retrieval A. Overwijk, D. Nguyen, C. Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong
Contents • Properties of a good evaluation method • Evaluation method of WebCLEF • Approach • Results • Analysis • Conclusion
Good evaluation method • Reflects the quality of the system • Reusability
Evaluation method of WebCLEF • Recall • The sum of character lengths of all spans in the response of the system linked to nuggets (i.e. an aspect the user includes in his article), divided by the total sum of span lengths in the responses for a topic in all submitted runs. • Precision • The number of characters that belong to at least one span linked to a nugget, divided by the total character length of the system’s response.
Approach • Better system, better performance scores? • Similar system, same performance scores? • Worse system, lower performance scores?
Better system • Last year’s best performing system contains a bugour %stopwords = qw( for my $w … { ‘s next if exists $stopwords{$w}; a … … } zwischen);
Similar system • General idea • Almost identical snippets should have almost the same precision and recall • Experiment • Remove the last word for every snippet in the output of last year’s best performing system
Worse system • Delivering snippets based on occurrence • 1st snippet = 1st paragraph of 1st document • 2nd snippet = 2nd paragraph of 2nd document • ... • No difference with search engines, except that documents are split up in snippets
Analysis • Pool of snippets • Implementation • Assessments
Conclusion • Evaluation method is not sufficient: • Biased towards participating systems • Correctness of a snippet is too strict • Recommendations: • N-grams (e.g. ROUGE) • Multiple assessors per topic