On the Evaluation of Snippet Selection for Information Retrieval

On the Evaluation of Snippet Selection for Information Retrieval A. Overwijk, D. Nguyen, C. Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong

Contents • Properties of a good evaluation method • Evaluation method of WebCLEF • Approach • Results • Analysis • Conclusion

Good evaluation method • Reflects the quality of the system • Reusability

Evaluation method of WebCLEF • Recall • The sum of character lengths of all spans in the response of the system linked to nuggets (i.e. an aspect the user includes in his article), divided by the total sum of span lengths in the responses for a topic in all submitted runs. • Precision • The number of characters that belong to at least one span linked to a nugget, divided by the total character length of the system’s response.

Approach • Better system, better performance scores? • Similar system, same performance scores? • Worse system, lower performance scores?

Better system • Last year’s best performing system contains a bugour %stopwords = qw( for my $w … { ‘s next if exists $stopwords{$w}; a … … } zwischen);

Better system

Similar system • General idea • Almost identical snippets should have almost the same precision and recall • Experiment • Remove the last word for every snippet in the output of last year’s best performing system

Similar system

Worse system • Delivering snippets based on occurrence • 1st snippet = 1st paragraph of 1st document • 2nd snippet = 2nd paragraph of 2nd document • ... • No difference with search engines, except that documents are split up in snippets

Worse system

Analysis • Pool of snippets • Implementation • Assessments

Conclusion • Evaluation method is not sufficient: • Biased towards participating systems • Correctness of a snippet is too strict • Recommendations: • N-grams (e.g. ROUGE) • Multiple assessors per topic

Questions

On the Evaluation of Snippet Selection for Information Retrieval