570 likes | 1.56k Views
Summarization Techniques . A. Bellaachia Computer Science Department School of Engineering and Applied Sciences George Washington University Washington, DC 20052. Research Team. Abdelghani Bellaachia, Simon Berkovich, Avinash Kanal Computer Science Department George Washington University
E N D
Summarization Techniques A. Bellaachia Computer Science Department School of Engineering and Applied Sciences George Washington University Washington, DC 20052 A. Bellaachia
Research Team Abdelghani Bellaachia, Simon Berkovich, Avinash Kanal Computer Science Department George Washington University Washington, DC And Anandpal Mahajan, Web Methods Virginia And Abdel-Hamid Gooda IBM Consulting Washington, DC A. Bellaachia
Motivation • Decide whether a document is relevant or not • What is the first thing you read in a novel? • Get the summary of a book • Often the summary is all that is read. • Provide summaries of retrieved web pages related to a user query. • Automatic abstract of technical paper • Human-generated summaries are expensive. A. Bellaachia
Motivation (Cont’d) • Think about your last minute REQUIRED ABSTRACT!!! • Document Length: 3073 words (No references and no title) • Summary length: 135 words • Extracted sentences length: 81 words (60% of the summary) A. Bellaachia
What is a Summary? • Informative summary • Purpose: replace original document • Example: executive summary • Indicative summary • Purpose: support decision: do I want to read original document yes/no? • Example: Headline, scientific abstract • Evaluative summary: • Purpose: express the point of view of the author on a given topic. • Example: I think this document focuses more on … A. Bellaachia
What Type of Summary? • Two types of summary: • Abstract • Extract • Abstract: • A set of manually generated sentences. • Extract: • A set of sentences extracted from the document. • Extract vs. Abstract? • An extracted summary remains closer to the original document; limiting the bias that might otherwise appear in a summary. A. Bellaachia
What Type of Summary? (Cont’d) • Text summaries can also be categorized into two types: • Query-relevant summaries: • The summary is created based on the terms in the input query. • As they are “query-biased”, they do not provide an overall sense of the document content. • Generic summaries: • A generic summary provides an overall sense of the document’s contents and determines which category it belongs to. • A good generic summary should contain the main topics of the document while keeping redundancy to a minimum. • It is a challenging task: It is generally hard to develop a high-quality generic summarization method. A. Bellaachia
Summarization Goals • The goals of text summarizers can be categorized by their intent, focus and coverage: • Intent • Focus • Coverage A. Bellaachia
Summarization Goals (Cont’d) • Intent: • Intent refers to the potential use of the summary. • Firmin and Chrzanowski divide a summary’s intent into three main categories: • Indicative: Indicative summaries give an indication of the central topic of the original text or enough information to judge the text’s relevancy. • Informative: Informative summaries can serve as substitutes for the full documents. • Evaluative: Evaluative summaries express the point of view of the author on a given topic. • Focus: Is the summary generic or query relevant? • Coverage: It refers to the number of sentences that contribute to the summary. A. Bellaachia
Proposed Summarizers • Three generic text summarization methods are presented. • They create text summaries by ranking and extracting sentences from the original documents. • Prior Work: • SUMMARIZER 1: uses standard Information Retrieval (IR) methods to rank sentence relevance. [Yihong Gong and Xin Liu, SIGIR 2001] A. Bellaachia
Proposed Summarizers (Cont’d) • Proposed Solutions: • SUMMARIZER 2: uses the IR TF*IDF weighting scheme to rank sentences and selects top sentences to form a summary. • SUMMARIZER 3: Uses the popular k-means clustering algorithm, where k is the number of sentences in the desired summary, and select the sentence with the highest TF*IDF weight (Sum of the weights of all terms in the sentence) from each cluster. • SUMMARIZER 4: Use the popular k-means clustering algorithm and generate a summary using a new k-NN based classification algorithm. A. Bellaachia
Summarization Approach • Each summarizer tries to select sentences that cover the This section introduces four generic text summarization techniques. • The summarization process follows a particular procedure that can be described in the steps below: • Segmentation: Decompose the document into individual sentences and use these sentences to form the candidate sentence set S. • Vectorization: Create (1) the weighted term-frequency vector Si for each sentence i S and (2)the weighted term-frequency vector D for the whole document A. Bellaachia
Vectorization: An IR Model • Get a set of all terms in the whole document and let n be the cardinality of this set. • Each term represents a dimension in an n dimensional space where n is the total number of term in the whole document. • Each sentence/document is a vector • D = (d1, d2, d3, d4, ... dn) • Si = (si1, si2, si3, si4, ... sin) A. Bellaachia
Vectorization: An IR Model (Cont’d) • Possible similarity measure: • Other measures: • Euclidean distance T3 5 D 4 Si 2 3 T1 7 T2 A. Bellaachia
SUMMARIZER 1 • The main steps of SUMMARIZER 1 are: • For each sentence i S, compute the relevance measure between Si and D: Inner Product, or Cosine Similarity, or Jaccard coefficient. • Select sentence Sk that has the highest relevance score and add it to the summary. • Delete Sk from S, and eliminate all the terms contained in Sk from the document vector and S vectors. Re-compute the weighted term-frequency vectors (D and all Si). • If the number of sentences in the summary reaches the predefined value, terminate the operation: otherwise go to step 1. A. Bellaachia
SUMMARIZER 2 • This summarizer is the simplest among all the proposed techniques. • It uses the TF*IDF weighting schema to select sentences. • It works as follows: • Create the weighted term-frequency vector Si for each sentence i S using TF*IDF (Term frequency * Inverse Document Frequency). • Sum up the TF*IDF score for each sentence and rank them. • Select the predefined number of sentences in the summary from S. A. Bellaachia
SUMMARIZER 3 • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 0 A. Bellaachia
SUMMARIZER 3 (Cont’d) • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 1 A. Bellaachia
SUMMARIZER 3 (Cont’d) • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 2 A. Bellaachia
SUMMARIZER 3 (Cont’d) • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 3 A. Bellaachia
SUMMARIZER 3 (Cont’d) • This summarizer works as follows: • Create the weighted term-frequency vector Ai for each sentence Si using TF*IDF. • Form a sentences-by-terms matrix and feed it to the K-means clustering algorithm to generate k clusters. • Sum up the TF*IDF score for each sentence in each cluster. • Pick the sentence with the highest TF*IDF score from within each cluster and add it to the summary. A. Bellaachia
SUMMARIZER 4 • This summarizer uses Potential Attractor Class (PAC) technique to generate a summary • PAC is a k nearest neighbor (k-NN) based technique to generate a summary. • How does k-NN work? • Training set includes classes: a set of classes Ci • Run K-means to generate the initial classes. • For each new item • Examine k items from the training classes that are near to this item. • Apply a decision rule to select the the class to to which the new item will be belong to. • K is determined empirically. A. Bellaachia
SUMMARIZER 4 (Cont’d) ? Class C1 Class C2 Class C3 A. Bellaachia
SUMMARIZER 4 (Cont’d) • Decision rule: • Try to identify the class membership of the new item: What is the label of the new item? . • Voting: The new item is assigned to the class that has the largest number of items in the k closest neighbors. • Distance weighted: an enhanced version of Voting (Next slide) • PAC: uses laws of physics to determine the membership of a new item (Next slide) A. Bellaachia
SUMMARIZER 4 (Cont’d) • Distance weighted: • The class of the new item is the one that has the largest weight: Weighted Count (Ci) = where dk – dj is the distance between neighbor j of classi and the kth nearest neighbor and dk – d0 is the distance between the first neighbor and the kth nearest neighbor. A. Bellaachia
SUMMARIZER 4 (Cont’d) • PAC: • Step1: get the k nearest neighbors. • Step2: calculate the distance di between the new item q and the center of nearest neighbors ,in the k nearest neighbors, from each class. • Step3: calculate the mass mi of the nearest neighbors from each class. This mass, per class, is equal to the number of nearest neighbors , in the k nearest neighbors, from that class. • Step4: calculate the Class Force CF(Ci) that attracts sample q to each class: • Step5: assign q to the class that has the highest CF (PAC decision rule). A. Bellaachia
Performance Evaluation • Document Understanding Conferences (DUC) datasets from the National Institute of Standards and Technology (NIST). • The dataset includes three sets of documents from each independent human evaluator/selector. Each set has between 3 and 20 documents. Each selector builds summaries (abstracts) for each document in the set with an approximate length of 100 words. • A sample of the DUC data was chosen for our evaluation: 2 sets of documents (one set from each of 2 selectors). • The set from Selector 1 consists of 5 documents, whereas the set from Selector 2 contains 4 documents. • For each document, we have a summary (abstract) from each selector. A. Bellaachia
Questions … Thanks A. Bellaachia