720 likes | 956 Views
Summarization Technologies and Evaluation 自動摘要技術與評估. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University 陳信希 國立台灣大學資訊工程學系. Outlines. Introduction Architecture of a Summarization System An Evaluation Model using Question Answering System
E N D
Summarization Technologies and Evaluation自動摘要技術與評估 Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University 陳信希 國立台灣大學資訊工程學系
Outlines • Introduction • Architecture of a Summarization System • An Evaluation Model using Question Answering System • Multilingual News Summarizer • Conclusion
Why Summarization is needed? • Owing to the widespread use of the Internet, a large scale of multicultural information can be obtained • Eliminate some degree of bottlenecks on the information highway • Issues • How to absorb and employ information effectively • How to tackle the problem of multilingual document clustering
Where is Summarization? • Headline • Table of contents • Preview • Digest • Highlights • Abstract • Bulletin • Biography :
What is Text Summarization? • The Process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) • Extract vs. Abstract • An extract is a summary consisting entirely of material copied from the input • An abstract is a summary at least some of whose material is not present in the input, e.g., subject categories, paraphrase of content, etc.
Characteristics of Summaries • Reduction of information content • Compression Rates • Target Length • Informativeness • Fidelity to Source • Relevance to User’s Interest • Well-formedness • Extracts: need to avoid gap, dangling anaphora, etc. • Abstracts: need to produce grammatical, plausible output
Evaluation of Summaries • Intrinsic methods test the system in itself • Criteria • Coherence • Informativeness • Methods • Comparison against reference output • Comparison against summary input • Extrinsic methods test the system in relation to some other task • Time to perform tasks, accuracy of tasks, ease to use • Expert assessment of usefulness in task
Summarization Approaches • The research of Text summarization begins very early (Luhn, 1958; Edmundson, 1964, 1969) • Single Document Summarization • Chen, Lin, Huang, and Chen, 1998; Kupiec, pedersen, and Chen, 1995; Lin and Hovy, 1997; Brunn, Chali, and Pinchak, 2001, etc. • Multiple Document Summarization • Chen and Huang, 1999, Mckeown and Radev, 1995; Mani and Bloedorn, 1997; Radev and Mckeown, 1998; Lin and Hovy, 2002, etc. • Summac-1(1998) - Evaluation Tasks • Document Understanding Conference (2000)
SUMMAC Evaluation • Extrinsic measures • Those that ignore the content of the summary and assess it solely according to how useful it is in enabling an agent to perform some measurable task • Ad Hoc Task • Support a relevance judgment • Categorization Task • Support a categorization judgment • Intrinsic measures • Those that examine the content of the summary and attempt to pass some judgment on it directly • Question-Answering Task • Acceptability Task
Issues • In a multi-document summarization • How to decide which documents deal with the same topic and which sentences touch on the same event are indispensable • How to measure similarity on different levels (i.e., words, sentences and documents) • In a multilingual multi-document summarization, • Due to the multiliguality problem, how to measure the similarity on concepts, themes and topics in terms of different language • Due to the human assessors’ involvement, the large scale evaluation is nearly impossible
A News Clusterer • The tasks for the clusterer are listed below • Employing a segmentation system to identify Chinese words • Extracting named entities like people, place, organization, time, date and monetary expression • Applying a tagger to determine the part of speech for each word • Clustering the news stream based on the named entities and other signatures, such as speech-act and locative verbs
A News Summarizer • The tasks for the news summarizer are shown as follows: • Partitioning a Chinese text into several meaningful units (MUs) • Chinese writers often assign punctuation marks at random, the sentence boundary is not clear, so MUs are used for clustering instead of sentences • A MU that is composed several sentence segments denotes a complete meaning
A News Summarizer • Partitioning a Chinese text into several meaningful units (MUs) • Three kinds of linguistic knowledge are used to identify the MUs • Punctuation marks (Yang, 1984) • Linking elements (Li and Thompson, 1981) • 因為天氣不好,飛機改在明天起飛。 • 我想早一點來,可是我沒趕上公車。 • 他一邊走路,一邊唱歌。 • Topic chains (Chen, 1994) • 國民黨是靠組織起家的政黨,現在的組織體質卻很虛弱,所以選戰最後也要仰仗文宣,實在很可惜。
A News Summarizer • For example: (A) 儘管警方大肆鎮壓與逮捕,無數反對自由貿易的示威群眾今天仍繼續向西雅圖市中心前進,他們發動和平集會,以抗議世界各國貿易部長即將在西雅圖舉行討論全球貿易自由化的會議。 => (A1) 儘管警方大肆鎮壓與逮捕,無數反對自由貿易的示威群眾今天仍繼續向西雅圖市中心前進 (A2) 他們發動和平集會,以抗議世界各國貿易部長即將在西雅圖舉行討論全球貿易自由化的會議
A News Summarizer • Linking the meaningful units, denoting the same thing, from different news reports • The similarity of two MUs is in term of noun-similarity and verb-similarity m(n): the number of matched nouns (verbs) a,b: total number of nouns in MUs A and B c,d: total number of verbs in MUs A and B
A News Summarizer • Several strategies in similarity model (S1) Nouns in one MU are matched to nouns in another MU, so are the verbs. (S2) The operations in (1) are exact matches. (S3) A Chinese thesaurus is employed during the matching. That is, the operations in (S1) may be relaxed to inexact matches. (S4) Each term specified in (S1) is matched only once. (S5) The order of nouns and verbs in MU is not taken into consideration. (S6) The order of nouns and verbs in MU is critical, but it is relaxed within a window. (S7) When continuous terms are matched, an extra score is added to the similarity measure. (S8) When the object of transitive verbs are not matched, a score is subtracted from the similarity measure. (S9) When date/time expressions and monetary and percentage expressions are matched, an extra score is added to the similarity measure.
A News Summarizer • Displaying the summarization results by two kinds modes: • Focusing summarization • a sequence of news by information decay • Browsing summarization • The MUs which are reported more than twice are selected • For each set of similar MUs, only the longest sentence is used in the summary
Browsing Summarization Browsing model: The first article
Browsing Summarization Browsing model: The first article
Browsing Summarization Browsing model: The third article
Experiment • Preparation of Testing Corpus • Nine events, which were occurred within 1998/11/7 and 1998/12/8, were manually selected from Central Daily News, China Daily Newspaper, China Times Interactive, and FTV News online in Taiwan • Each event was composed of more than two articles, which were reported in the same day
Experiment • Preparation of Test Corpus • (1) 社會役的實施 (military service): 6 articles • (2) 老丙建建築 (construction permit): 4 articles • (3) 三芝鄉土石流 (landslide in Shan Jr): 6 articles • (4) 總統布希之子 (Bush's sons): 4 articles • (5) 芭比絲颱風侵台 (Typhoon Babis): 3 articles • (6) 股市穩定基金 (stabilization fund): 5 articles • (7) 國父墨寶失竊案 (theft of Dr Sun Yat-sen's calligraphy): 3 articles • (8) 央行調降利率 (interest rate of the Central Bank): 3 articles • (9) 內閣總辭問題 (the resignation issue of the Cabinet): 4 articles
Experiment • Preparation of Test Corpus • Annotator reads all the news articles, and connects the MUs that discuss the same story • Five models shown below are constructed under various combination of the strategies specified above: • (M1) strategies (S1)+(S3)+(S4)+(S5) • (M2) strategies (S1)+(S3)+(S4)+(S6) • (M3) strategies (S1)+(S3)+(S4)+(S5)+(S7)+(S8) • (M4) strategies (S1)+(S3)+(S4)+(S5)+(S7)+(S8)+(S9) • (M5) strategies (S1)+(S2)+(S4)+(S5)+(S7)+(S8)+(S9)
Experiment • Performance of similarity of MUs The thresholds of none-similarity and verb-similarity are set to 0.3
Discussion • Some issues • The compression rate is fixed by the system • The presentation order of sentences in a summary is based on the relative position in the original documents instead of their importance • The voting strategy gives a shorter summarization, which might miss unique information reported by only once
Generating Summaries with Informative Words • The concepts of topic words and event words were applied to topic ranking successfully. (Fukumoto and Suzuki,2000) • An event word associated with a story appears across paragraphs, but a topic word does not. • The topic words frequently appears across all documents
Generating Summaries with Informative Words • We define that the words which have both high term frequency and high document frequency are informative words (salience words) • The sentences which have more informative words will be extracted to generate summaries • The more the informative words a MU has, the more important the MU is
Generating Summaries with Informative Words • The score function for deciding the informative words Ntf: normalized term frequency, DF: document frequency Tf: temfrequency , mtf: mean term frequency
Generating Summaries with Informative Words • Only 10 terms with the higher IW scores will be chosen as informative words in a document • The score for each MU symbolizes the total number of informative words in it and the MUs with the highest score will be selected • Moreover, the selected MUs in a summary will be arranged in the descending order
Generating Summaries with Informative Words • Experiment result (QA task)
Generating Summaries with Informative Words • Experiment results • Data • Collected from 6 news sites in Taiwan • 17,877 documents (near 13 MB) from 1/1/2001 to 5/1/2001 • After clustering, there are 3,146 events • 12 events are selected randomly in the experiment and 60 questionnaires (5 questions for each event) are made manually with answers to their related documents
Generating Summaries with Informative Words • Experiment results • 12 members of our laboratory who are all graduate students majoring in computer science are selected to conduct experiments below • Full text (FULL) • Chen and Huang’s system (Basic) • Term frequency with vote strategy (TFWV) • Informative words with vote strategy (PSWV) • Term frequency without vote strategy (TFNV) • Informative words without vote strategy (PSWV)
Generating Summaries with Informative Words • Experiment results
Generating Summaries with Informative Words • Discussion • Observation • The size of TFNV and PSNV is larger than that of BASIC (near 15%), but the precision rate of TFNV and PSNV is lower than that of BASIC • The size of TFWV and PSWV is smaller than that of BASIC, and their precision rate is still smaller than that of BASIC • The precision rate of both TFWV and PSWV are larger than those of TFNV and PSNV
Generating Summaries with Informative Words • Discussion • Due to the limitation and drawbacks of human assessment, evaluation shown below may mislead • Due to different background among human assessors, the evaluation is unable to be objective • Fatique and limited of time scale to work may effect the assessors to quit reading or read too fast so as to miss the information • Due to the high cost of assessors, the large-scale evaluation is nearly impossible
Model Using Question Answering System • Question and Answering System (Lin, et al., 2001) • Three major modules • Preprocessing the question sentences • Part-of –speech processing, stop-words removing • Canonical form transformation and key word expansion • Retrieving the documents containing answers • score(D) = • Retrieving the sentences containing answers • The sentences that contain most words in expanded question sentence are retrieved
Model Using Question Answering System MRR: Mean Reciprocal Rank (Voorhees 2000)
Discussion • The difference between Table 2 and Table 3 • QA_MRR values of TFNV and PSNV are larger than those of corresponding TFWV and PSWV • QA_MRR values of PSWV and PSNV are larger than those of the corresponding TFWV and TFNV • Comparing the precision of QA task with the corresponding precision of best-5 strategy, Q&A system is is better than QA task (i.e., 0.576 > 0.502 and 0.559 > 0.513, respectively)
Experiments Using Large Documents and Results • Data set • 140 new questionnaires are made and 93 questions have been answered • Samples of question
Experiments Using Large Documents and Results • Result Table 4. Results with Large-Scale Data
Experiments Using Large Documents and Results • Discussion • Due to the increase of document size, the QA_MRR of all models decreased • Due to the noise of FULL, its QA_MRR drops drastically. However, other models’ QA_MRR values increase comparing with Table 3 • The QA_MRR values of TFWV, PSWV, TFNV and PSNV are also larger than the value of BASIC
Experiments Using Large Documents and Results • Discussion • The QA_MRR values of PSWV and PSNV are also larger than those of TFWV and TFNV, respectively • Since the performance of each model has the similar results to those shown in Table 4, it is feasible to introduce Q&A system into the evaluation of summarization
Basic Architecture • The major issues behind the system • How to represent documents in different language • How to measure the similarity among document presentation in different language • The granularity of similarity computation • Visualization of summarization