Developing Learning Strategies for Topic-based Summarization

Developing Learning Strategies for Topic-based Summarization You Ouyang, Sujian Li, Wenjie Li [CIKM ‘07] Advisor:Dr. Koh Jia-Ling Reporter: Che-Wei, Liang Date: 2008/05/15

Outline • Introduction • Topic-based summarization • Feature Design • Training Data Construction • Model Learning • Experiment • Conclusion

Introduction • Sentence scoring • key process for ranking and selecting sentences. • Selection of the appropriate features highly influences system performance. BUT,Combination of features should also important.

Introduction • Simply combined features by a linear function have several shortcomings . • Performance is not predictable. • Complexity grows exponentially as features set become large. • Objective: • Explore how the optimal weights can be obtained automatically by developing learning strategies. • Two fundamental issue • Learning models • Training data

Introduction • Apply machine learningapproach • regarding sentence scoring as a regression problem. • Provide a way of combining the features automatically and effectively. • To make use of human summaries, • Develop N-gram methods to approximately measure the “true” sentence scores.

Topic-based summarization • Topic description

Feature Design • Sentences are scored according to the features. • Design a set of features • Three topic dependent features, • Four topic independent features. • Design criterion: try to capture the important information that a sentence conveys.

Feature Design • Word Matching Feature • Semantics Matching Feature

Feature Design • Name Entity Matching Feature • is the number of the named entities in both s and q. • Document Centroid Feature

Feature Design • Named Entity Number Feature • Stop Word Penalty Feature

Feature Design • Sentence Position Feature

Training Data Construction • Hypothesis: • If the human summaries are the excellent sum-ups of the documents which contain abundant information, the sentences in the documents which are more similar to those in human summaries should be more likely to be the good sum-ups as well.

Training Data Construction • Given a document set D A human summary set H = {H1,…,Hm} Each sentence s in D is assigned a score(s|H) • The scoring methods compute the N-gram probabilities of s to be recognized as a summary sentence given human summaries.

Frequency-based Methods • The probability of an N-gram t under a single human summary Hi can be calculated by • Two strategies to obtain the probability of t under human summaries. • Maximum strategies. • Average strategies.

Frequency-based Methods • Maximum- select the largest probability • Average • Score(s|H)

Frequency-based Methods • All human summaries are almost of the same length, can be simplified to • Two alternative sentence scoring methods. 1. 2.

Appearance-based Methods • Binary N-gram appearance judgment • Sentence scoring methods are revised as

Model Learning: SVR-based Methods • Models are trained from the document sets Dwhere the human summaries H are given. • Regression problem • The task of predicting the score of the sentence s in another document sets D’ given its F(s). • To generate regression function based on

Model Learning: SVR-based Methods

Model Learning: SVR-based Methods • Once the regression function f0 is learned, define • Normalized by the sentence length, can refined as,

Redundancy Removal • Redundant information problem • If the terms of two sentences are very similar, the sentences may probably have the approximate score. • Maximum marginal relevance(MMR) • Select sentences from highest score to lowest • Compute similarity with the sentences selected before, select if not too similar.

Experiment • Document set • DUC 2006 and DUC 2005. • Present the results of the eight combinations with consideration of • N-grams (Uni-gram or Bi-gram), • probability calculations (frequency or appearance), • Scoring strategies (Maximum or Average).

Experiment The system developed with N-gram methods even much better than human summarizer.

Experiment

Conclusion • Proposes the methods for • Constructing training data based on human summaries, • Training sentence scoring models based on regression models • SVR-based system can achieve very good performances.

Developing Learning Strategies for Topic-based Summarization