350 likes | 473 Views
Evaluating Summary Content Selection. Pyramid Method: Work in Progress Rebecca Passonneau Ani Nenkova. OUTLINE. Motivation Problems DUC Evaluations Pyramid Method: Current Status Open Issues Conclusions. EVALUATION GOALS. Define parameters of the problem What is summarization?
E N D
Evaluating Summary Content Selection Pyramid Method: Work in Progress Rebecca Passonneau Ani Nenkova
OUTLINE • Motivation • Problems • DUC Evaluations • Pyramid Method: Current Status • Open Issues • Conclusions
EVALUATION GOALS • Define parameters of the problem • What is summarization? • Compare systems • Is the metric meaningful? • Track progress • When does output improve? • Cost Effectiveness • Can it be (partly) automated?
PICTURING CONTENT “OVERLAP” Philippine Airlines (PAL) experienced a crisis in 1998. Unable to make payments on a $2.1 billion debt, it was faced by a pilot's strike in June and the region's currency problems whichreduced passenger numbersand inflated costs.On September 23 PAL shut downafter the ground crew union turned down a settlement which it accepted two . . . Starting in May 1998, Philippine Airlines (PAL) laid off 5000 of its 13,000 workers. A3-weekpilots' strike in Juneand a currency crisis thatreduced passenger numbersmade paymentsonPAL's $2 billion debtdebt impossible. President Estrada brokered an agreement to suspend collective bargaining for 10 years in exchange for 20% of PAL stock and union seats on its board.The large ground crew union initially voted no.AfterPAL shut down operationsfor 13 days starting Sept. 23rd,leaving much of the country without air service and foreign . . .
OBSTACLES • Humans select different content • Humans present same content differently • Lack clear standard of “good” summary [Contrasts with translation: L1(C)L2(C)] • Need objective method to get at subjective notion of what a summary IS
PREVIOUS WORK: Pessimism Human Judgments • Extraction • Low Agreement (Rath, 1961; Salton et al, 1997) • Inconsistent over time (Rath, 1961; Lin & Hovy, 2002) • Abstraction (Depends on individual’s orientation (Gerrig et al1991) Automated Evaluation • Extraction (Pastra & Saggion, 2003 EACL) • 3-humans; multiple “models”; inconclusive • Abstraction (Lin & Hovy, 2002 ACL) • Accepts inconsistent judgments as target • Difficult to extend
PREVIOUS WORK: Optimism Good design methodology leads to better understanding areas of agreement • High compression rate leads to high agreement (Jing et al., 1998) • Content variation offset by logarithmic growth in pool of distinct content units (Halteren & Teufel,2003) • Content can be reliably annotated (Beck et al., 1991)
HOW TO GET AT “CONTENT” FROM ITS “EXPRESSION” • ADAPT BLEU MT EVALUATION • Collect multiple “model” summaries • Quantify ngram overlap • IDENTIFY ABSTRACT CONTENT UNITS • DUC • Reading Comprehension • A THIRD WAY • Content unit “level” • Multiple expressions of same content unit
DUC: THE CURRENT APPROACH • Yearly evaluation of systems on new data sets • NIST evaluations performed by humans • Widely cited results • Does it work? • Compare current systems • Track individual system progress • Track community progress from year to year • Identify specific strengths/weaknesses • Can it eventually be automated?
DUC SCORING METHOD • Datasets: human/machine summaries • Designate “model” human summary • (Automatically) identify content units in “model” summary • Split “peer” summaries into sentences • Human judges evaluate “peer” against model
COMPUTE DUC SCORES • For each EDU: • Does peer sentence express any part • How much? (0, 20, 40, 60, 80, 100%) • Average EDU percent overlap scores • Resulting score ranges from 0 to 1
DRAWBACKS TO DUC SCORES • Very sensitive to choice of “model” • All “model” units created equal • Difficult to interpret scores • Human summary scores as low as 0.1 • Scores vary for same summarizer • Scores vary for same summary • Systems cannot be differentiated
FOUNDATION OF PYRAMID • A few CUs appear in many summaries • Humans can identify same/different CUs Weight CUs differentially
MULTIPLE GOOD SUMMARIES This pyramid predicts 6 different good summaries consisting of 4 SCUs:
PAL PYRAMID TIER: W=3 (N=4) SCU1: PAL has $2.1 billion debt H2 [PAL’s $2 billion debt]1 I1 [and with a rising $2.1 billion debt,]1 J3 [PAL is buried under a $2.2 billion dollar debt]1 SCU2: PAL enforced a shutdown H5 [After PAL shut down operations]2 I1 [stopped all operations]2 J5 [by a]2 [shutdown]2 SCU3: PAL in crisis H1 [Philippine Airlines]3 I1 [Philippines Airlines (PAL),]3 [devastated]3 J1 [The fate]3 [is uncertain.]3
PAL PYRAMID TIER: W=2 (N=8) SCU5: PAL unable to repay debt H2 [made payments on]5 [impossible.]5 J3 [it cannot repay]5 SCU6: PAL experienced pilots' strike H2 [A]5 [pilots' strike]6 I1 [by pilot]5 [strikes]6 SCU7: this PAL crisis occurred in 1988 H1 [1998,]7 I1 [in 1998]7 . . .
ANNOTATION: KEEPING TRACK H1 [Starting in May]23 [1998,]7 [Philippine Airlines]3 [laid off 5000 of its 13,000 workers.]24 H2 [A]6 [3-week]25 [pilots' strike]6 [in June]11 [and a currency crisis]12 [that reduced passenger numbers]13 H3 [President Estrada brokered an agreement to suspend collective bargaining for 10 years]17 [in exchange for 20% of PAL stock and union seats on its board.]26 H4 [The large ground crew union initially voted no.]18 H5 [After PAL shut down operations]2 [for 13 days]4 [starting Sept. 23rd,]8 [leaving much of the country without air service]27 [and foreign carriers flying some domestic routes,]9 [61% voted yes.]19 . . .
RELIABILITY Two Annotators Consensus Annotation • Number of SCUs: 33 versus 37 35 • Count of Pairwise Agreements (PAs) • SCU Label • SCU Members • Comparison of Annotations to Consensus • Recall/Precision not valid • 65/69 PAs • Most “disagreements” due to membership size • Only 2 “conflicts”
PYRAMID SCORE PART 1 • For N summaries, score each “peer” against a pyramid with N-1 tiers • “Peer” annotation • Gives SCU “size” • Yields a residue of SCUs not in pyramid • Compute D (Observed distribution) where D=sum of weights of SCUs EG: Summary A (D30042), size=20 D=(6x3) + (6x2) + (4x1) + (4x0) = 34
PYRAMID SCORE PART II • Compute Max = Ideal Sum of weights of SCUs, given the summary SCU size • Pyramid of H,I,J: • 9 SCUs in tier, w=3 • 10 SCUs in tier, w=2 • 12 SCUs in tier, w=1 • Size=20, Max=(9x3) + (10x2) + (1x1)=48 • P=D/Max PA= 34/48=.71
MACHINE SUMMARY EXAMPLE African countries voted in June to ignore the U.N. flight ban which was imposed in 1992 to try and force Libya to hand over for trial two suspects wanted in the 1988 bombing of an American airliner over Lockerbie, Scotland. The reported jailing of the three officials comes asGadhafi is under pressure to accept a plan to turn over for trial two other Libyans wanted for the 1988 bombing of Pan am flight 103 over Lockerbie, Scotland,that led to 270 deaths. The visit was Farrakhan's…
MACHINE SUMMARIES System 6 PAL, Asia’s oldest airline, has been unable to make payments on dlrs 2.1 billion debt after being devasted by a pilot’s strike and by Asia’s currency crisis. PAL earlier accepted a preliminary investment offer from Cathay Pacific, Ailing Philippine Airlines and prospective investor Cathy Pacific Airways have clashed over . . .
MACHINE SUMMARIES System 16 President Joseph Estrada on Saturday urged militant unionists at Philippine Airlines to accept a vote by workers approving a 10-year no-strike deal to revive the debt-laden airline. President Joseph Estrada said Saturday the financially troubled airlines will resume its international flights on Sunday by flying him to Singapore . . .
MACHINE SUMMARIES System 17 Christmas is a sacred holiday in the Philippines, and nowhere is that more evident than at the headquarters of Philippine Airlines. But Ramos, who was intent on privatizing the economy, opened the industry to competition, licensing rivals like Air Philippines, Cebu Pacific, and Grand Air. PAL closed for nearly 2 weeks on Sep. 23 after . . .
OPEN ISSUES • Distribution of SCUs NOT an independent variable • Ordering • Knowledge • Informational Goal • Can Pyramid Scoring be Automated?
SCU INTERDEPENDENCIES • SCU4 presupposes SCU1: SCU1 (w=4): PAL has a debt > 2 billion SCU4 (w=3): PAL cannot make its debt payments • SCU7, SCU8 depend on SCU2 SCU2 (w=4): PAL shutdown operations SCU7 (w=3): shutdown began on 9/23 SCU8 (w=3): shutdown lasted 2 weeks
SCUs and DEPENDENCY/TAG GR A3 [On September 23]7 [PAL shut down]2 [after the ground crew union turned down a settlement]18 [which it accepted two weeks later.]19 SCU7 1 On IN 5 shut t0 2 September NNP 4 PAL t2 3 23 CD 4 PAL t2
“LARGE” CONSTITUENTS 1. PAL experienced a crisis in 1998. 2. Unable to make payments on a $2.1 billion debt, 3. it was faced by a pilot's strike in June 4. and the region's currency problems 5. which reduced passenger numbers and inflated costs. 6. On September 23 pal shut down 7. after the ground crew union turned down a settlement 8. which it accepted two weeks later. 9. PAL resumed domestic flights on October 7 10. and [resumed] international flights on October 26. 11. Resolution of the basic financial problems was elusive, however, 12. and as of December 18 pal was still $2.2 billion in debt 13. and [pal was] losing close to $1 million a day.
DOCSET TF*IDF TERMS: $2, airline, billion, day, debt, pal (6 of 13 LCs) 1 1. Philippine Airlines (pal) experienced a crisis in 1998. SCU3 w=3 3 2. Unable to make payments on a $2.1 billiondebt, SCU1 w=4 1 6. On September 23 pal shut down SCU2 w=4 & SCU7 w=3 1 9. pal resumed domestic flights on October 7 SCU10 w=2 4 12. and as of December 18 pal was still $2.2 billion in debt NO SCU 1 13. and losing close to $1 million a day. SCU15 w=2
CONCLUSIONS • Define parameters of the problem • What is summarization? • Compare systems and/or humans • Is the metric meaningful? • Track progress • When does output improve? • Cost Effectiveness • Can it be (partly) automated?