1 / 26

Developing Learning Strategies for Topic-based Summarization

Developing Learning Strategies for Topic-based Summarization. You Ouyang, Sujian Li, Wenjie Li [CIKM ‘07] Advisor: Dr. Koh Jia-Ling Reporter: Che-Wei, Liang Date: 2008/05/15. Outline. Introduction Topic-based summarization Feature Design Training Data Construction Model Learning

lsegura
Download Presentation

Developing Learning Strategies for Topic-based Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing Learning Strategies for Topic-based Summarization You Ouyang, Sujian Li, Wenjie Li [CIKM ‘07] Advisor:Dr. Koh Jia-Ling Reporter: Che-Wei, Liang Date: 2008/05/15

  2. Outline • Introduction • Topic-based summarization • Feature Design • Training Data Construction • Model Learning • Experiment • Conclusion

  3. Introduction • Sentence scoring • key process for ranking and selecting sentences. • Selection of the appropriate features highly influences system performance. BUT,Combination of features should also important.

  4. Introduction • Simply combined features by a linear function have several shortcomings . • Performance is not predictable. • Complexity grows exponentially as features set become large. • Objective: • Explore how the optimal weights can be obtained automatically by developing learning strategies. • Two fundamental issue • Learning models • Training data

  5. Introduction • Apply machine learningapproach • regarding sentence scoring as a regression problem. • Provide a way of combining the features automatically and effectively. • To make use of human summaries, • Develop N-gram methods to approximately measure the “true” sentence scores.

  6. Topic-based summarization • Topic description

  7. Feature Design • Sentences are scored according to the features. • Design a set of features • Three topic dependent features, • Four topic independent features. • Design criterion: try to capture the important information that a sentence conveys.

  8. Feature Design • Word Matching Feature • Semantics Matching Feature

  9. Feature Design • Name Entity Matching Feature • is the number of the named entities in both s and q. • Document Centroid Feature

  10. Feature Design • Named Entity Number Feature • Stop Word Penalty Feature

  11. Feature Design • Sentence Position Feature

  12. Training Data Construction • Hypothesis: • If the human summaries are the excellent sum-ups of the documents which contain abundant information, the sentences in the documents which are more similar to those in human summaries should be more likely to be the good sum-ups as well.

  13. Training Data Construction • Given a document set D A human summary set H = {H1,…,Hm} Each sentence s in D is assigned a score(s|H) • The scoring methods compute the N-gram probabilities of s to be recognized as a summary sentence given human summaries.

  14. Frequency-based Methods • The probability of an N-gram t under a single human summary Hi can be calculated by • Two strategies to obtain the probability of t under human summaries. • Maximum strategies. • Average strategies.

  15. Frequency-based Methods • Maximum- select the largest probability • Average • Score(s|H)

  16. Frequency-based Methods • All human summaries are almost of the same length, can be simplified to • Two alternative sentence scoring methods. 1. 2.

  17. Appearance-based Methods • Binary N-gram appearance judgment • Sentence scoring methods are revised as

  18. Model Learning: SVR-based Methods • Models are trained from the document sets Dwhere the human summaries H are given. • Regression problem • The task of predicting the score of the sentence s in another document sets D’ given its F(s). • To generate regression function based on

  19. Model Learning: SVR-based Methods

  20. Model Learning: SVR-based Methods • Once the regression function f0 is learned, define • Normalized by the sentence length, can refined as,

  21. Redundancy Removal • Redundant information problem • If the terms of two sentences are very similar, the sentences may probably have the approximate score. • Maximum marginal relevance(MMR) • Select sentences from highest score to lowest • Compute similarity with the sentences selected before, select if not too similar.

  22. Experiment • Document set • DUC 2006 and DUC 2005. • Present the results of the eight combinations with consideration of • N-grams (Uni-gram or Bi-gram), • probability calculations (frequency or appearance), • Scoring strategies (Maximum or Average).

  23. Experiment The system developed with N-gram methods even much better than human summarizer.

  24. Experiment

  25. Conclusion • Proposes the methods for • Constructing training data based on human summaries, • Training sentence scoring models based on regression models • SVR-based system can achieve very good performances.

More Related