The Pythy Summarization System: Microsoft Research at DUC 2007

The Pythy Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki, and Lucy Vanderwende Microsoft Research April 26, 2007

DUC Main Task Results • Automatic Evaluations (30 participants) • Human Evaluations • Did pretty well on both measures

Overview of Pythy • Linear sentence ranking model • Learns to rank sentences based on: • ROUGE scores against model summaries • Semantic Content Unit (SCU) weights of sentences selected by past peers • Considers simplified sentences alongside original sentences

Sentences PYTHY Training Simplified Sentences Docs Docs Targets Ranking/ Training ROUGE Oracle Pyramid/ SCU ROUGE X 2 Model Docs Docs Feature inventory

Sentences PYTHY Testing Simplified Sentences Docs Docs Search Model Dynamic Scoring Docs Docs Summary Feature inventory

Sentences PYTHY Training Simplified Sentences Docs Docs Targets Ranking Training ROUGE Oracle Pyramid/ SCU ROUGE X 2 Model Sentence Simplification Docs • Extension of simplification method for DUC06 • Provides sentence alternatives, rather than deterministically simplify a sentence • Uses syntax-based heuristic rules • Simplified sentences evaluated alongside originals • In DUC 2007: • Average new candidates generated: 1.38 per sentence • Simplified sentences generated: 61% of all sents • Simplified sentences in final output: 60% Docs Feature inventory

Sentences PYTHY Training Simplified Sentences Docs Docs Targets Ranking Training ROUGE Oracle Pyramid/ SCU ROUGE X 2 Model Sentence-Level Features Docs • SumFocus features: SumBasic (Nenkova et al 2006) + Task focus • cluster frequency and topic frequency • only these used in MSR DUC06 • Other content word unigrams: headline frequency • Sentence length features (binary features) • Sentence position features (real-valued and binary) • N-grams (bigrams, skip bigrams, multiword phrases) • All tokens (topic and cluster frequency) • Simplified Sentences (binary and ratio of relative length) • Inverse document frequency (idf) Docs Feature inventory

Sentences PYTHY Training Simplified Sentences Docs Docs Targets Ranking Training ROUGE Oracle Pyramid/ SCU ROUGE X 2 Model Pairwise Ranking Docs • Define preferences for sentence pairs • Defined using human summaries and SCU weights • Log-linear ranking objective used in training • Maximize the probability of choosing the better sentence from each pair of comparable sentences Docs [Ofer et al. 03], [Burges et al. 05] Feature inventory

Sentences PYTHY Training Simplified Sentences Docs Docs Targets Ranking Training ROUGE Oracle Pyramid/ SCU ROUGE X 2 Model Rouge Oracle Metric Docs • Find an oracle extractive summary • the summary with the highest average ROUGE-2 and ROUGE-SU4 scores • All sentences in the oracle are considered “better” than any sentence not in the oracle • Approximate greedy search used for finding the oracle summary Docs Feature inventory

Sentences PYTHY Training Simplified Sentences Docs Docs Targets Ranking Training ROUGE Oracle Pyramid/ SCU ROUGE X 2 Model Pyramid-Derived Metric Docs • University of Ottawa SCU-annotated corpus (Copeck et al 06) • Some sentences in 05 & 06 document collections are: • known to contain certain SCUs • known not to contain any SCUs • Sentence score is sum of weights of all SCUs • for un-annotated sentences, the score is undefined • A sentence pair is constructed for training s1 >s2 iff w(s1)>w(s2) Docs Feature inventory

Sentences PYTHY Training Simplified Sentences Docs Docs Targets Ranking Training ROUGE Oracle Pyramid/ SCU ROUGE X 2 Model Model Frequency Metrics Docs • Based on unigram and skip bigram frequency • Computed for content words only • Sentence siis “better” than sj if Docs Feature inventory

Sentences PYTHY Training Simplified Sentences Docs Docs Targets Ranking Training ROUGE Oracle Pyramid/ SCU ROUGE X 2 Model Combining multiple metrics Ranking Training Docs Feature inventory • From ROUGE oracle all sentences in oracle summary better than other sentences • From SCU annotations sentences with higher avg SCU weights better • From model frequency sentences with words occurring in models better • Combined loss: adding the losses according to all metrics Docs

Sentences PYTHY Testing Simplified Sentences Docs Docs Search Model Dynamic Scoring Docs Docs Summary Feature inventory

Search Dynamic Sentence Scoring Dynamic Scoring • Eliminate redundancy by re-weighting • Similar to SumBasic (Nenkova et al 2006), re-weighting given previously selected sentences • Discounts for features that decompose into word frequency estimates

Search Search Dynamic Scoring • The search constructs partial summaries and scores them: • The score of a summary does not decompose into an independent sum of sentence scores • Global dependencies make exact search hard • Used multiple beams for each length of partial summaries • [McDonald 2007]

Impact of Sentence Simplification • Trained on 05 data, tested on O6 data

Evaluating the Metrics Trained on 05 data, tested on 06 data Includes simplified sentences

Update Summarization Pilot • SVM novelty classifier trained on TREC 02 & 03 novelty track

Summary and Future Work • Summary • Combination of different target metrics for training • Many sentence features • Pair-wise ranking function • Dynamic scoring • Future work • Boost robustness • Sensitive to cluster properties (e.g., size) • Improve grammatical quality of simplified sentences • Reconcile novelty and (ir)relevance • Learn features over whole summaries rather than individual sentences

Thank You

The Pythy Summarization System: Microsoft Research at DUC 2007