260 likes | 272 Views
Developing Learning Strategies for Topic-based Summarization. You Ouyang, Sujian Li, Wenjie Li [CIKM ‘07] Advisor: Dr. Koh Jia-Ling Reporter: Che-Wei, Liang Date: 2008/05/15. Outline. Introduction Topic-based summarization Feature Design Training Data Construction Model Learning
E N D
Developing Learning Strategies for Topic-based Summarization You Ouyang, Sujian Li, Wenjie Li [CIKM ‘07] Advisor:Dr. Koh Jia-Ling Reporter: Che-Wei, Liang Date: 2008/05/15
Outline • Introduction • Topic-based summarization • Feature Design • Training Data Construction • Model Learning • Experiment • Conclusion
Introduction • Sentence scoring • key process for ranking and selecting sentences. • Selection of the appropriate features highly influences system performance. BUT,Combination of features should also important.
Introduction • Simply combined features by a linear function have several shortcomings . • Performance is not predictable. • Complexity grows exponentially as features set become large. • Objective: • Explore how the optimal weights can be obtained automatically by developing learning strategies. • Two fundamental issue • Learning models • Training data
Introduction • Apply machine learningapproach • regarding sentence scoring as a regression problem. • Provide a way of combining the features automatically and effectively. • To make use of human summaries, • Develop N-gram methods to approximately measure the “true” sentence scores.
Topic-based summarization • Topic description
Feature Design • Sentences are scored according to the features. • Design a set of features • Three topic dependent features, • Four topic independent features. • Design criterion: try to capture the important information that a sentence conveys.
Feature Design • Word Matching Feature • Semantics Matching Feature
Feature Design • Name Entity Matching Feature • is the number of the named entities in both s and q. • Document Centroid Feature
Feature Design • Named Entity Number Feature • Stop Word Penalty Feature
Feature Design • Sentence Position Feature
Training Data Construction • Hypothesis: • If the human summaries are the excellent sum-ups of the documents which contain abundant information, the sentences in the documents which are more similar to those in human summaries should be more likely to be the good sum-ups as well.
Training Data Construction • Given a document set D A human summary set H = {H1,…,Hm} Each sentence s in D is assigned a score(s|H) • The scoring methods compute the N-gram probabilities of s to be recognized as a summary sentence given human summaries.
Frequency-based Methods • The probability of an N-gram t under a single human summary Hi can be calculated by • Two strategies to obtain the probability of t under human summaries. • Maximum strategies. • Average strategies.
Frequency-based Methods • Maximum- select the largest probability • Average • Score(s|H)
Frequency-based Methods • All human summaries are almost of the same length, can be simplified to • Two alternative sentence scoring methods. 1. 2.
Appearance-based Methods • Binary N-gram appearance judgment • Sentence scoring methods are revised as
Model Learning: SVR-based Methods • Models are trained from the document sets Dwhere the human summaries H are given. • Regression problem • The task of predicting the score of the sentence s in another document sets D’ given its F(s). • To generate regression function based on
Model Learning: SVR-based Methods • Once the regression function f0 is learned, define • Normalized by the sentence length, can refined as,
Redundancy Removal • Redundant information problem • If the terms of two sentences are very similar, the sentences may probably have the approximate score. • Maximum marginal relevance(MMR) • Select sentences from highest score to lowest • Compute similarity with the sentences selected before, select if not too similar.
Experiment • Document set • DUC 2006 and DUC 2005. • Present the results of the eight combinations with consideration of • N-grams (Uni-gram or Bi-gram), • probability calculations (frequency or appearance), • Scoring strategies (Maximum or Average).
Experiment The system developed with N-gram methods even much better than human summarizer.
Conclusion • Proposes the methods for • Constructing training data based on human summaries, • Training sentence scoring models based on regression models • SVR-based system can achieve very good performances.