130 likes | 263 Views
Automatic Summary Evaluation. Ross Greenwood. Recap. Automatically evaluate summaries of text documents Evaluate content coverage Compare against one or more ideal summaries. Pyramid Evaluation. Manually annotate texts for phrases expressing similar ideas (summary content units)
E N D
Automatic Summary Evaluation Ross Greenwood
Recap • Automatically evaluate summaries of text documents • Evaluate content coverage • Compare against one or more ideal summaries
Pyramid Evaluation • Manually annotate texts for phrases expressing similar ideas (summary content units) • Judge content coverage by number of overlapping summary content units
ROUGE: Four Summary Evaluation Measures • ROUGE-N: N-gram Co-Occurrence • Number of matching N-word substrings • ROUGE-L: Longest Common Subsequence • Allows for skipping words • Ex. “a b d f” is a subsequence of “a b c d e f” • ROUGE-W: Weighted LCS • Weight consecutive matches higher • ROUGE-S: Skip-bigram • Number of matching 2-word substrings with arbitrary gaps
ROUGE: Four Summary Evaluation Measures • ROUGE-N: N-gram Co-Occurrence • Number of matching N-word substrings • ROUGE-L: Longest Common Subsequence • Allows for skipping words • Ex. “a b d f” is a subsequence of “a b c d e f” • ROUGE-W: Weighted LCS • Weight consecutive matches higher • ROUGE-S: Skip-bigram • Number of matching 2-word substrings with arbitrary gaps
Precision, Recall, and F-Measure • Precision = matches/num_words_peer • Recall = matches/num_words_models • F = 2/(1/P + 1/R)
Problems with ROUGE-N: False Positives • Homographs, ex: Model: … robbed the bank … Peer: … sat on the river bank …
Problems with ROUGE-N: False Negatives • Synonyms, ex: Model: … held up the financial institution … Peer: … robbed the bank …
Solution: WordNet • Lexical Database • Synsets: organize words by concepts • Method: • Tag words with POS • Tag words with meaning (senseLearner) • Lookup synset in WordNet
Architecture of Solution WordNet {go#v#7, pass#v#6, lead#v#6, extend#v#2} querySense(“run#v#3”, “syns”) POS tagger senseLearner ROUGE Results Data
Evaluating the Evaluator • Correlation with human evaluation scores (ROUGE, Basic Elements) • Success at reducing errors (i.e. number of false negatives/positives avoided vs. original ROUGE)
References • Lin, C.Y. (2004). Rouge: a package for automatic evaluation of summaries. Workshop On Text Summarization Branches Out • Fellbaum, C. (Ed.). (1998). Wordnet: an electronic lexical database. Cambridge, MA: MIT Press.