1 / 52

The Pyramid Method at DUC05

The Pyramid Method at DUC05. Ani Nenkova Becky Passonneau Kathleen McKeown Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman. Overview. Review of Pyramids (Kathy) Characteristics of the responses Analyses (Ani) Scores and Significant Differences

jflora
Download Presentation

The Pyramid Method at DUC05

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Pyramid Method at DUC05 Ani Nenkova Becky Passonneau Kathleen McKeown Other team members: David Elson, Advaith Siddharthan, Sergey Siegelman

  2. Overview • Review of Pyramids (Kathy) • Characteristics of the responses • Analyses (Ani) • Scores and Significant Differences • Reliability of Pyramid scoring • Comparisons between annotators • Impact of editing on scores • Impact of Weight 1 SCUs • Correlation with responsiveness and Rouge • Lessons learned

  3. Pyramids • Uses multiple human summaries • Previous data indicated 5 needed for score stability • Information is ranked by its importance • Allows for multiple good summaries • A pyramid is created from the human summaries • Elements of the pyramid are content units • System summaries are scored by comparison with the pyramid

  4. Summarization Content Units • Near-paraphrases from different human summaries • Clause or less • Avoids explicit semantic representation • Emerges from analysis of human summaries

  5. SCU: A cable car caught fire (Weight = 4) A. The cause of the firewas unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps,caught fireinside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacitycaught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

  6. SCU: The cause of the fire is unknown (Weight = 1) A. The cause of the firewas unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps,caught fireinside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacitycaught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

  7. SCU: The accident happened in the Austrian Alps (Weight = 3) A. The cause of the firewas unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps,caught fireinside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacitycaught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

  8. Tiers of differentially weighted SCUs Top: few SCUs, high weight Bottom: many SCUs, low weight Idealized representation W=3 W=2 W=1

  9. Creation of pyramids • Done for each of 20 out of 50 sets • Primary annotator, secondary checker • Held round-table discussions of problematic constructions that occurred in this data set • Comma separated lists • Extractive reserves have been formed for managed harvesting of timber, rubber, Brazil nuts, and medical plants without deforestation. • General vs. specific • Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey

  10. Characteristics of the Responses • Proportion of SCUs of Weight 1 is large • 44% (D324) to 81% (D695) • Mean SCU weight: 1.9 Agreement among human responders is quite low

  11. # of SCUs at each weight SCU Weights

  12. Pyramids: DUC 2003 • 100 word summaries (vs. 250 word) • 10 500-word articles per cluster (vs. 30 720-word articles) • 3 clusters (vs. 20 clusters) • Mean SCU Weight (7 models) • 2005: avg 1.9 • 2003: avg 2.4 • Proportion of SCUs of W=1 • 2005: avg – 60%, 44% to 81% • 2003: avg – 40%, 37% to 47%

  13. DUC03 DUC05 .4 .4

  14. Computing pyramid scores:Ideally informative summary • Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

  15. Ideally informative summary • Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

  16. Ideally informative summary • Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

  17. Ideally informative summary • Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

  18. Ideally informative summary • Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

  19. Ideally informative summary • Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well

  20. Original Pyramid Score SCORE = D/MAX D: Sum of the weights of the SCUs in a summary MAX: Sum of the weights of the SCUs in a ideally informative summary Measures the proportion of good information in the summary: precision

  21. Modified pyramid score (recall) • EN = average SCUs in human models • This is the number of content units humans chose to convey about the story • W=Compute the weight of a maximally informative summary of size EN • D/W is the modified pyramid score • Shows the proportion of expected good information

  22. Scoring Methods • Presents scores for the 20 pyramid sets • Recompute Rouge for comparison • We compute Rouge using only 7 models • 8 and 9 reserved for computing human performance • Best because of significant topic effect • Comparisons between Pyramid (original,modified), responsiveness, and Rouge-SU4 • Pyramids score computed from multiple humans • Responsiveness is just one human’s judgment • Rouge-SU4 equivalent to Rouge-2

  23. Preview of Results • Manual metrics • Large differences between humans and machines • No single system the clear winner • But a top group identified by all metrics • Significant differences • Different predictions from manual and automatic metrics • Correlations between metrics • Some correlation but one cannot be substituted for another • This is good

  24. Human performance/Best sys Pyramid Modified Resp ROUGE-SU4 B: 0.5472 B: 0.4814 A: 4.895 A: 0.1722 A: 0.4969 A: 0.4617 B: 4.526 B: 0.1552 ~~~~~~~~~~~~~~~~~ 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 Best system ~50% of human performance on manual metrics Best system ~80% of human performance on ROUGE

  25. Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

  26. Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

  27. Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.197214: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

  28. Pyramid original Modified Resp Rouge-SU4 14: 0.258710: 0.20524: 2.8515: 0.139 17: 0.2492 17: 0.197214: 2.84: 0.134 15: 0.242314: 0.190810: 2.6517: 0.1346 10: 0.23797: 0.185215: 2.6 19: 0.1275 4: 0.232115: 0.180817: 2.55 11: 0.1259 7: 0.22974: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

  29. Significant Differences • Manual metrics • Few differences between systems • Pyramid: 23 is worse • Responsive: 23 and 31 are worse • Both humans better than all systems • Automatic (Rouge-SU4) • Many differences between systems • One human indistinguishable from 5 systems

  30. Multiple and pairwise comparisons • Multiple comparisons • Tukey’s method • Control for the experiment-wise type I error • Show fewer significant differences • Pairwise comparisons • Wilcoxon paired test • Controls the error for individual comparisons • Appropriate how your system did for development

  31. Peer Better than • Modified pyramid: significant differences • One systems accounts for most of the differences • Humans significantly better than all systems

  32. Responsiveness 1: Significant differences • Differences primarily between 2 systems • Differences between humans and each system

  33. Responsive-2 • Similar shape to original

  34. Skip-bigram: significant differences • Many more differences between systems than any manual metric • No difference between human and 5 systems

  35. Pairwise comparisons: Modified Pyramid

  36. Agreement between annotators

  37. Editing of participant annotations • To correct obvious errors • Ensures uniform checking • Predominantly involved correct splitting unmatching SCUs • Average paired differences • Original: 0.0043 • Modified: 0.0005 • Average magnitude of the difference • Original: 0.0115 • Modified: 0.0032

  38. Excluding weight 1 SCUs • Removing weight 1 SCUs improves agreement • Kappa: 0.64 (was 0.57) • Annotating without weight 1 has negligible impact on scores • Set D324 done without weight 1 SCUs • Ave.magnitude between paired differences • On average 0.07 difference

  39. Correlations: Pearson’s, 25 systems

  40. Correlations: Pearson’s, 25 systems Questionable that responsiveness could be a gold standard

  41. Pyramid and responsiveness High correlation, but the metrics are not mutually substitutable

  42. Pyramid and Rouge High correlation, but the metrics are not mutually substitutable

  43. Lessons Learned • Comparing content is hard • All kinds of judgment calls • We didn’t evaluate the NIST assessors in previous years • Paraphrases • VP vs. NP • Ministers have been exchanged • Reciprocal ministerial visits • Length and constituent type • Robotics assists doctors in the medical operating theater • Surgeonsstarted using robotic assistants

  44. Modified scores better • Easier peer annotation • Can drop weight 1 SCUs • Better agreement • No emphasis on splitting non-matching SCUs

  45. Agreement between annotators • Participants can perform peer annotation reliably • Absolute difference between scores • Original: 0.0555 • Modified: 0.0617 • Empirical prediction of difference 0.06 (HLT 2004)

  46. Correlations • Original and modified can substitute for each other • High correlation between manual and automatic, but automatic not yet a substitute • Similar patterns between pyramid and responsiveness

  47. Current Directions • Automated identification of SCUs (Harnly et al 05) • Applied to DUC05 pyramid data set • Correlation of .91 with modified pyramid scores

  48. Questions • What was the experience annotating pyramids? • Does it shed insight on the problem • Are people willing to do it again? • Would you have been willing to go through training? • If you’ve done pyramid analysis, can you share your insights

More Related