170 likes | 251 Views
Colouring Summaries BLEU. Katerina Pastra and Horacio Saggion. Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Machine Translation vs. Summarization. MT: accurate and fluent translation of source doc
E N D
Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Pastra and Saggion, EACL 2003
Machine Translation vs. Summarization MT: accurate and fluent translation of source doc Auto Sum: informative, reduced version of source • We will focus on: • Automatically generated extracts • Single-document Summarization - Sentence level compression • Automatic content-based evaluation • Reuse of evaluation metrics across NLP areas Pastra and Saggion, EACL 2003
The challenge MT: demanding content evaluation Extracts: is their evaluation trivial by definition ??? • Idiosyncrasies of the extract evaluation task: • Compression level and rate • High human disagreement on extract adequacy • Could an MT evaluation metric be ported to Automatic Summarization (extract) evaluation ? • If so, which testing parameters should be considered? Pastra and Saggion, EACL 2003
BLEU • Developed for MT evaluation (Papineni et al. ’01) • => achieves high correlation with human judgement • => is reliable even when run • >> on different documents • >> against different number of model references • i.e. reliability is not affected by the use of • either multiple references or just a single one Pastra and Saggion, EACL 2003
Using BLEU in NLP • NLG (Zajic and Dorr, 2002) • Summarization (Lin and Hovy, 2002) >> 0.66 correlation in single-document summaries at 100 words compression rate against a single- reference summary >>0.82 correlation when multiple-judged document units (sort of multiple references) used Lin-Hovy conclude: The use of a single reference affects reliability Pastra and Saggion, EACL 2003
Evaluation Experiments set up • Variables: • compression rate, text cluster, gold standard • HKNews Corpus (English - Chinese) • 18K documents in English • 40 thematic clusters = 400 documents • each sentence in the cluster assessed by 3 judges • with utility values (0-10) • encoded in XML Pastra and Saggion, EACL 2003
Evaluation Software • Semantic tagging and Statistical Analysis Software • Features: position, similarity with document, similarity with query, term distribution, NE scores, etc. (all normalised) • Features are linearly combined to obtain sentence scores and sentence extracts • Gate & Summarization classes Pastra and Saggion, EACL 2003
Gold standards and summarisers • QB = Query-sentence similarity summary • Simple 1 = Doc-sentence similarity summary • Simple 2 = Lead-based summary • Simple 3 = End-of-document summary • Reference n = utility based extract based on the • utility given by judge n (n = 1,2,3) • Reference all = utility based extract based on the • sum of utilities given by the n judges Pastra and Saggion, EACL 2003
Experiment 1 • 2 references compared against the third in 5 • different compression rates in two text clusters • (all available combinations) Are the results BLEU gives on inter-annotator agreement consistent ? => Inconsistency both across text clusters and within clusters at different compression rates (the latter more consistent than the former) => Reliability of BLEU in Sum seems to depend on values of the variables used. If so, how could one identify the appropriate values? Pastra and Saggion, EACL 2003
Experiment 1 • 2 references compared against the third in 5 • different compression rates in two text clusters • (all available combinations) Pastra and Saggion, EACL 2003
Experiment 2 For reference X within cluster Y across compression rates the ranking of the systems is not consistent Pastra and Saggion, EACL 2003
Experiment 3 For reference X at compression Y across clusters the ranking of the systems is not consistent Pastra and Saggion, EACL 2003
Experiment 4 For reference ALL across clusters at multiple compression rates the ranking of the systems is(more) consistent Pastra and Saggion, EACL 2003
Experiment 4 (cont.) Is there a way to use BLEU with a single reference summary and still get reliable results back? Pastra and Saggion, EACL 2003
Notes on BLEU • Fails to capture semantic equivalences between n-grams in both their various lexical and syntactical manifestations Examples: “Of the 9 ,928drug abusers reported in first half of the year, 1,445 or 14 .6% were aged under 21.” vs. “...number of reported abusers” “This represents a decrease of 17% over the 1 ,740 young drug abusers in the first half of 1998.” Pastra and Saggion, EACL 2003
Conclusions • Use of multiple reference summaries needed • when using BLEU in Summarization • Lack of such resources could probably be • overcome using the average rank aggregation • technique • Future work: • Scaling up of the experiments • Correlation of BLEU with other content-based • metrics used in Summarization Pastra and Saggion, EACL 2003