440 likes | 823 Views
By : asef poormasoomi autumn 2009. Automatic Text Summarization. Introduction. summary : brief but accurate representation of the contents of a document. Motivation. Abstracts for Scientific and other articles News summarization (mostly Multiple document summarization)
E N D
By : asefpoormasoomi autumn 2009 Automatic Text Summarization
Introduction • summary: brief but accurate representation of the contents of a document
Motivation • Abstracts for Scientific and other articles • News summarization (mostly Multiple document summarization) • Classification of articles and other written data • Web pages for search engines • Web access from PDAs, Cell phones • Question answering and data gathering
Genres • Extract vs. abstract • lists fragments of text vs. re-phrases content coherently. • example : He ate banana, orange and apple=>He ate fruit • Generic vs. query-oriented • provides author’s view vs. reflects user’s interest. • example : question answering system • Personal vs. general • consider reader’s prior knowledge vs. general. • Single-document vs. multi-document source • based on one text vs. fuses together many texts. • Indicative vs. informative • used for quick categorization vs. content processing.
Summarization In 3 steps (Lin and Hovy -1997) • Content/Topic Identification • goal : find/extract the most important material. • techniques : methods based on position, cue phrases, conceptcounting, word frequency. • Conceptual/Topic Interpretation • application : just for abstract summaries • methods : merging or fusing related topics into more general ones, removingredundancies, etc. • example: • He sat down, read the menu, ordered, ate and left =>He visited the restaurant. • Summary Generation: • say it in your own words • Simple if extraction if preformed
Methods • Statistical scoring methods (Pseudo) • Higher semantic/syntactic structures • Network (graph) based methods • Other methods (rhetorical analysis, lexical chains, co-reference chains) • AI methods
Statistical scoring (Pseudo) • General method: • score each entity (sentence, word) ; • combine scores; • choose best sentence(s) • Scoring tecahniques: • Word frequencies throughout the text (Luhn 58) • Position in the text (Edmunson 69, Lin&Hovy 97) • Title method (Edmunson 69) • Cue phrases in sentences (Edmunson 69) • Bayesian Classifier(Kupiec at el 95)
Word frequencies (Luhn 58) • Very first work in automated summarization • Claim: words which are frequent in a document indicate the topic discussed • Frequent words indicate the topic • Clusters of frequent words indicate summarizing sentence • Stemming should be used • “stop words” (i.e.”the”, “a”, “for”, “is”) are ignord
Word frequencies (Luhn 58) • Calculate term frequency in document: f(term) • Calculate inverse log-frequency in corpus : if(term) • Words with high f(term)if(term) are indicative • Sentence with highest sum of weights is chosen
Position in the text(Edmunson69, Lin&Hovy 97) • Claim : Important sentences occur in specific positions • Position depends on type(genre) of text • inverse of position in document works well for the “news” • Important information occurs in specific sections of the document (introduction/conclusion) • Assign score to sentences according to location in paragraph • Assign score to paragraphs and sentences according to location in entire text
Title method (Edmunson 69) • Claim: title of document indicates its content (Duh!) • words in title help find relevant content • create a list of title words, remove “stop words” • Use those as keywords in order to find important sentences
Cue phrasesmethod(Edmunson 69) • Claim : Important sentences contain cue words/indicative phrases • “The main aim of the present paper is to describe…” (IND) • “The purpose of this article is to review…” (IND) • “In this report, we outline…” (IND) • “Our investigation has shown that…” (INF) • Some words are considered bonus others stigma • bonus: comparatives, superlatives, conclusive expressions, etc. • stigma: negatives, pronouns, etc. • Implemented for French (Lehman ‘97) • Paice implemented a dictionary of <cue,weight> • Grammar for indicative expressions • In + skip(0) + this + skip(2) + paper + skip(0) + we + ... • Cue words can be learned (Teufel’98)
Feature combination (Edmundson ’69) • Linear contribution of 4 features • title, cue, keyword, position • the weights are adjusted using training data with any minimization technique • The following results were obtained • best system • cue + title + position
Bayesian Classifier (Kupiec at el 95) • Uses Bayesian classifier: • Assuming statistical independence: • Higher probability sentences are chosed to be in the summary • Performance: • For 25% summaries, 84% precision
Methods • Statistical scoring methods • problems : • Synonymy: one concept can be expressed by different words. • example cycle and bicycle refer to same kind of vehicle. • Polysemy: one word or concept can have several meanings. • example, cycle could mean life cycle or bicycle. • Phrases: a phrase may have a meaning different from the words in it. • An alleged murderer is not a murderer (Lin and Hovy 1997) • Higher semantic/syntactic structures • Network (graph) based methods • Other methods (rhetorical analysis, lexical chains, co-reference chains) • AI methods
Higher semantic/syntactic structures • Claim: Important sentences/paragraphs are the highest connected entities in more or less elaborate semantic structures. • Classes of approaches • lexical similarity (WordNet, lexical chains); • word co-occurrences; • co-reference; • combinations of the above.
Lexical chain • lexical cohesion : (Hasan , Halliday) • reiteration • synonym • antonym • hyperonym • collocation • co occurance • example :او به عنوان معلم در مدرسه کار می کند • Lexical chain : • Sequence of words which have lexical cohesion(Reiteration/Collocation)
Lexical chain • Method for creating chain: • Select a set of candidate words from the text. • For each of the candidate words, find an appropriate chain, relying on a relatedness criterion among members of the chains and the candidate words. • If such a chain is found, insert the word in this chain and update it accordingly; else create a new chain. • Scoring the chains : • synonym =10, antonym=7, hyponym=4 • Strong chain must select • Sentence selection for summary • H1: select the first sentence that contains a member of a strong chain • example : Chain: AI=2 ; Artificial Intelligence =1 ; Field=7 ; Technology=1 ; Science=1 • H2: select the first sentence that contains a “representative” (frequency) member of the chain • H3: identify a text segment where the chain is highly dense (density is the proportion of words in the segment that belong to the chain)
Lexical chain • Mr. Kenny is theperson that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achineve much closer monitoring of the pump feeding the anesthetic into the patient.
Network based method (Salton&al’97) • Vector Space Model • each text unit represented as vector • Standard similarity metric • Construct a graph of paragraphs or other entities. Strength of link is the similarity metric • Use threshold to decide upon similar paragraphs or entities (pruning of the graph) • paragraph selection heuristics • bushy path • select paragraphs with many connections with other paragraphs and present them in text order • depth-first path • select one paragraph with many connections; select a connected paragraph (in text order) which is also well connected; continue
Text relation map sim>thr B C A sim<thr D B=1 C=2 A=3 F D=1 E similarities F=2 links based on thr E=3
Motivation • summaries which are generic in nature do not cater to the user’s background and interests • results show that each person has different perspective on the same text • Marcu-1997: found percent agreement of 13 judges over 5 texts from scientific America is 71 percent. • Rath-1961 : found that extracts selected by four different human judges had only 25 percent overlap • Salton-1997 : found that most important 20 paragraphs extracted by 2 subjects have only 46 percent overlap
Users Feedback • Data Click: • when a user clicks on a document, the document is considered to be of more interest to the user than other unclicked ones • Query History: • is the most widely used implicit user feedback at present. • example : http://www.google.com/psearch • Attention Time : • often referred to as display time or reading time • Other types of implicit user feedbacks : • Other types of implicit user feedbacks include, scrolling, annotation, bookmarking and printing behaviors
Summarization Using Data click • use extra knowledge of the clickthrough data to improve Web-page summarization • collection of clickthrough data, can be represented by a set of triples < u; q; p > • Typically, a user's query words , reflect the true meaning of the target Web-page content • Problems : • incomplete click problem • noisy data click
Attention Time • MAIN IDEA • The key idea is to rely on the attention (reading) time of individual users spent on single words in a document. • The prediction of user attention over every word in a document is based on the user’s attention during his previous reads • algorithm tracks a user’s attention times over individual words using a vision-based commodity eye-tracking mechanism. • use simple web camera and an existent eye-tracking algorithm “Opengazer project” • The error of the detected gaze location on the screen is between 1–2 cm, depending which area of the screen the user is looking at (a 19” screen monitor).
Attention Time • Anchoring Gaze Samples onto Individual Words • the detected gaze central point is positioned at (x; y) on the screen space • compute the central displaying point of the word which is denoted as (xi; yi). • For each gaze detected by eye-tracking module, assign the gaze samples to the words in the document in this manner. • The overall attention that a word in the document receives is the sum of all the fractional gaze samples it is assigned in the above process • During processing, remove the stop words.
Attention Time • attention time prediction for a word is based on the semantic similarity of two words. • for an arbitrary word w which is not among, calculate the similarity between w and every wi(i = 1,…, n) • select kwords which share the highest semantic similarity with w. • Predicting User Attention for Sentences
Other types of implicit user feedbacks • extract the personal information of the user using information available on the web • put the person’s full name to a search engine (name is quoted with double quotation such as ”Albert Einstein”) • ’n’ top documents are taken and retrieved. • After performing the removal of stop words and stemming, a unigram language model is learned on the extracted text content. • User Specific Sentence Scoring : • sentence score :
Other types of implicit user feedbacks • Example • Topic of summary generation is ”Microsoft to open research lab in India” • 8 articles published in different new sources forms the news cluster • User A is from NLP domain and User B from networksecuritydomain. • Generic summary: The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said PadmanabhanAnandan, Managing Director Of Microsoft Research India. Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States . In Line With Microsoft’s Research Strategy Worldwide , The Bangalore Lab Will Collaborate With And Fund Research At Key Educational Institutions In India, Such As The Indian Institutes Of Technology, Anandan Said . Although Microsoft Research Doesn’t Engage In Product Development Itself, Technologies Researchers Create Can Make Their Way Into The Products The Company
Other types of implicit user feedbacks • User A Specific summary : The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said PadmanabhanAnandan, Managing Director Of Microsoft Research India.Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States. Microsoft Will Collaborate With The Government Of India And The Indian Scientific Community To Conduct Research In Indic Language Computing Technologies, This Will Include Areas Such As MachineTranslation Between IndianLanguages And English, Search And Browsing And Character Recognition.In Line With Microsoft’s Research Strategy Worldwide,The Bangalore Lab
Other types of implicit user feedbacks • User B Specific summary : The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said PadmanabhanAnandan , Managing Director Of Microsoft Research India. The Newly Announced India Research Group Focuses On Cryptography, Security, Algorithms And Multimedia Security, RamarathnamVenkatesan, A Leading Cryptographer At Microsoft Research In Redmond, Washington, In The US, Will Head The New Group. Microsoft Research India will conduct a four-week summer school featuring lectures by leading experts in the fields of cryptography, algorithms and security. The program is aimed at senior undergraduate students, graduate students and faculty
FarsiSumA Persian text summarizer By : NimaMazdak , Martin Hassel Department of Linguistics Stockholm University 2004
FarsiSum • Tokenizer: Sentence boundaries are found by searching for periods, exclamations , question marks and <BR> (the HTML new line) and the Persian question mark (؟) , “.”, “,”, “!”, “?”, “<”, “>”, “:”, spaces, tabs and new lines • Sentence Scoring: Text lines are put into a data structure16 for storing key/value called text table value
FarsiSum • Sentence Scoring: • Word score = (word frequency) * (a keyword constant) • Sentence Score = Σ word score (for all words in the current sentence) • average sentence length (ASL) • Average sentence length (ASL) = Word-count / Line-count • Sentence score = (ASL * Sentence Score)/ (nr of words in the current sentence)
FarsiSum • Notes on the Current Implementation : • Word Boundary Ambiguity : • stop (.) marks a sentence boundary, but it may also appear in the formation of abbreviations or acronyms. • Compound words and light verb constructions may also appear with or without a space . • Ambiguity in morphology • Word Order : • The canonical word order in Persian is SOV, but Persian is a free word order language • Possessive Construction