180 likes | 457 Views
Multiple Document Summarization using Principle Component Analysis incorporating Semantic Vector Space Model. Presenter Suhan Yu. Introduction. The ‘information content’ of a document can be measured by the relationship between the document and a corpus of related documents.
E N D
Multiple Document Summarization using Principle Component Analysis incorporating Semantic Vector Space Model Presenter Suhan Yu
Introduction • The ‘information content’ of a document can be measured by the relationship between the document and a corpus of related documents. • Multiple Document Summarization System: • Find the common topics in a corpus by matching sentences that are saying different things about the same topic.
Introduction Statistical Vector Space Model Action Word Classifier Wordnet Action words Objects Semantic Vector Space Model PCA 1.Sentence length cut-off feature 2.Position feature 3.Keyword weight Score sentence
Introduction • Analysis single document summarization: • Kupeic et. al • Estimate the probability • Analysis multiple document summarization: • Regina Barzilay et. al , Dragomir D. Radev et. al • Summarize multiple document on the same topic. • Trying to match sentences of same meaning to align multiple documents
m … n Statistical VSM construction • Define each unique word as a feature, terms are assumed to be independent. • Give a weight to each feature: • Cue-phrase Keyword • Topic Keyword • Term frequency
Semantic VSM construction • Using WordNet to form. • WordNet: http://www.globalwordnet.org/ • Online lexical reference system in which English nouns, verbs, adjectives and adverbs are organized into synonym sets or synsets. • With the help of WordNet, we can easily classify the word vector which belongs to ACTION class. • Knowledgebase (KB) • Seed wordlist which belongs to appearance or disappearance. • A seed wordlist set. • appearance or disappearance words such as • Destruction • Broke
Identification of Action Words • Discriminate if the word is ACTION word or not. Input: T={t1,t2,…,tn} appearance Seed wordlist WORDNET disappearance yes yes action Action Word
Identification of Action Words • For example: • Determine ‘devastation’ as action word. • From WordNet, following meaning obtained • Desolation: an event that results in total destruction • Ravaging • Destruction • From Desolation and Destruction meaning, it clearly lies in the phenomenon of appear/disappear. • devastation is a action word, and append devastation in the wordlist.
Finding the Objects of the Action • Find the objects • The Objects are the nearest Nouns or Adjectives for the Action. • Using POS Tagger to find. http://ilk.uvt.nl/~zavrel/tagtest.html
Classification of Contextual Words • Contextual words: defined as those action words which applied to the important objects. Term frequency
Principal Component Analysis • Using SVD to carry out PCA. m … n
Sentence Extraction • Consider following features: • Sentence-Length Cut off Feature: consider sentences which greater than 4 words. • Position Feature: consider the sentence is in the initial, middle or final of the document. • Keywords: set some keywords, then count how many keywords present. • Upper Case Feature: Sentence containing upper case words has been given additional weight.
Single Document Summary • Compare with MS Word Summarizer and Gnome Summarizer
Conclusion and Future work • Semantic VSM is better than Statistical VSM. • Rearrangement of Extracted Sentences in case of Multiple Documents Summarization to form effective summary. • Enhance Flexibility of system to generate summary of multiple documents not necessarily belonging to same topic. • Develop better methodology to incorporate the ACTION word score onto Statistical VSM. • Evaluation of System on large Sample of Data.