1 / 0

TAC Summarisation System

TAC Summarisation System. WING Meeting 8 Jul 2011. Ziheng Lin, Praveen Bysani , Jun-Ping Ng. Outline. Introduction Methodology Experimental Results Conclusion. Introduction. TAC 2011 Guided Summarization. Summarization: guided by “importance” of facts

shawn
Download Presentation

TAC Summarisation System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TAC Summarisation System

    WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng
  2. Outline Introduction Methodology Experimental Results Conclusion
  3. Introduction
  4. TAC 2011 Guided Summarization Summarization: guided by “importance” of facts Highly subjective and content-dependent Problems with generic summarization Sentence scoring based on term freq Hindered by synonyms and paraphrases Redundancy Extractive: Low readability and coherence
  5. TAC 2011 Guided Summarization Guided summarization: Topics: template-like categories, highly predictable elements A specific, unified information model Encourage abstractive summaries Task: Set A: input – 10 news articles and a topic, output – 100 word summary Set B: input – subsequent 10 news articles for the topic, output – 100 word update summary
  6. TAC 2011 Guided Summarization Before TAC 2010, a topic used to be: Title: Southern Poverty Law Center Narrative: Describe the activities of Moris Dees and the Southern Poverty Law Center. New topic format: category + aspect 5 topic categories: Accidents and Natural Disasters Attacks Health and Safety Endangered Resources Investigations and Trials
  7. TAC 2011 Guided Summarization Pre-defined aspects for each category: Health and Safety: WHAT: what is the issue WHO_AFFECTED: who is affected by the health/safety issue HOW: how they are affected WHY: why the health/safety issue occurs COUNTERMEASURES: countermeasures, prevention efforts
  8. TAC 2011 Guided Summarization Aim Achieve high ROUGE scores Direction Utilize the category and aspect info
  9. methodology
  10. Design Principles Need for a testbed to develop and verify ideas and techniques Simple to maintain Easy to use Quick-footed and flexible
  11. Architecture Pipeline of modules Independent Ruby modules Can concentrate on specific parts Linked up with Linux pipes Simple and stable Intermediate results improves robustness Information exchange via JSON Easy to program Human readable (to a certain extent)
  12. Overall Flow
  13. Summary Generation Pipeline
  14. Features
  15. Generic Word Importance Document Frequency - successful feature in past summarization tasks word level feature all relevant documents in a cluster DF (w) = d/D Extended version from unigrams to bigrams smoothed with unigrams for better recall during sentence scoring dfs = α ( dfs_uni) + 1- α (dfs_bi)
  16. KL-Divergence Step 1: Get statistics of words over reference corpus Step 2: Collapse words with similar distribution into same equivalence class Similarity measured with KL-Divergence Step 3: Repeat (1) and (2) for target document set Step 4: Naïve bayes formulation to compute likelihood of word appearing in document set
  17. Category relevance score DF extended to category level frequency in terms of both topics and documents in categories weighted linear combination of both crs = α ( top_freq) + 1- α (doc_freq)
  18. Category Differential Measures KL Divergence Compute difference between probability distributions To identify discriminative words for a category C-KLD of a word across current category ( c ) and rest of the categories ( c^) More the divergence, more discriminative the word for the category Calculating Importance Word Lists with highest divergence Average word divergence per sentence
  19. Category Differential Measures (cont.) Relevance Frequency (RF) Term weighting scheme for text categorization – Lan, Tan et.al different from idf and others, that are set in IR context Discriminative power of a word RF = log (2 + a/c) ‘c’ frequency in C^ ‘a’ frequency in C
  20. Exploration
  21. Named Entities In many categories, “who” and “where” are important aspects of the summary Use of named-entity recognition can identify people names and places How do we use this to improve our summaries?
  22. Results
  23. Baseline Experiments Trained on 2009 and tested on 2010
  24. Guided Experiments Test data (TAC 2010) split into two parts to test the efficiency of new features Features suffer from less category information in the training set
  25. Sample Summary What Who affected How Why Countermeasures Category – Health issues Topic – Pet food recall An unknown number of cats and dogs suffered kidney failure and about 10 died after eating the affected pet food. Menu Foods, the Ontario-based company that produced the pet food, said Saturday it was recalling dog food sold under 48 brands and cat food sold under 40 brands including Iams, Nutro and Eukanuba. The food was distributed throughout the United States, Canada and Mexico by major retailers such as Wal-Mart, Kroger and Safeway. However, the recalled products were made using wheat gluten purchased from a new supplier, since dropped for another source. The company said it manufacturers for 17 of the top 20 North American retailers.
  26. Conclusion
  27. It’s Just The Beginning Preprocessing Scoring Features Beyond SVR and MMR Postprocessing Sentence re-ordering Language generation
  28. References
  29. Baker and McCallum, Distributional Clustering of Words for Text Classification, SIGIR 1998
More Related