1 / 28

Text Summarisation based on Human Language Technologies and its Applications

Text Summarisation based on Human Language Technologies and its Applications. Elena Lloret Pastor Supervisor: Dr. Manuel Palomar Seminar - June 2011. Outline. Introduction State of the Art COMPENDIUM Text Summarisation Tool Evaluation and Experiments COMPENDIUM in HLT Applications

leigh
Download Presentation

Text Summarisation based on Human Language Technologies and its Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Summarisation based on Human Language Technologies and its Applications Elena Lloret Pastor Supervisor: Dr. Manuel Palomar Seminar - June 2011

  2. Outline • Introduction • State of the Art • COMPENDIUMTextSummarisationTool • Evaluation and Experiments • COMPENDIUM in HLT Applications • Conclusion

  3. IntroductionMOTIVATION • Human Language Technologies (HLT) • Allow people to communicate with machines by using natural language (Cole, 1997) • Intelligent applications based on HLT • Information retrieval • Question Answering • Text Classification • Opinion Mining • Text Summarisation

  4. IntroductionMOTIVATION • Why is Text Summarization (TS) needed? • To condense information, keeping at the same time, the most relevant one • Help users to manage and process large amounts of information The 2008 Summer Olympics took place in Beijing, China, from August 8 to August 24, 2008. A total of 11,028 athletes from 204 National Olympic Committees (NOCs) competed in 28 sports and 302 events. It was the third time that the Summer Olympic Games were held in Asia, after Tokyo, Japan in 1964 and Seoul, South Korea in 1988. The program for the Beijing Games was quite similar to that of the 2004 Summer Olympics held in Athens. There were 28 sports and 302 events. Moreover, there were 43 new world records and 132 new Olympic records set at the 2008 Summer Olympics. Chinese athletes won the most gold medals, with 51, and 100 medals altogether, while the United States had the most medals total with 110. There were many memorable champions but it was Michael Phelps and Usain Bolt who stole the headlines. Source documents: http://en.wikipedia.org/wiki/2008_Summer_Olympics http://en.beijing2008.cn/# http://www.olympic.org/beijing-2008-summer-olympics 17.500.000 results

  5. State of the ArtTYPES OF SUMMARIES

  6. State of the ArtTEXT SUMMARISATION PROCESS • Topic identification • What the document is about • Interpretation or topic fusion • Important topics are expressed using new formulation • Summary generation • Natural Language Generation is applied to build the final summary

  7. State of the ArtGENERATION OF SUMMARIES • Approaches • Statistical-based  tf, tf*idf (e.g.Orăsan, 2009) • Topic-basedeventwords (e.g.Kuo & Chen, 2008) • Graph-basedLexRank (e.g.Erkan & Radev, 2004) • Discourse-based lexical chains (e.g.Barzilay & Elhadad, 1999) • Machine learning-basedneuronal nets (e.g.Svore et al., 2007)

  8. State of the ArtGENERATION OF SUMMARIES • New types of summaries • Personalisedsummaries userprofiles (e.g. Díaz & Gervás, 2007) • Updatesummaries“history” (e.g. Li et al., 2008) • Sentiment-basedsummariesmulti-aspect rating model (e.g.Titov, & McDonald, 2008) • Surveyssummaries Wikipediaarticles (e.g.Sauper & Barzilay, 2009) • Abstractivesummariessentencecompression (e.g.Filippova, 2010)

  9. State of the ArtGENERATION OF SUMMARIES • New scenarios • Literarytext books (e.g.Ceylan & Mihalcea, 2009) • Patentclaims(e.g.Trappey et al., 2009) • Imagecaptioning(e.g.Aker & Gaizauskas, 2010) • Web 2.0 textual genres blogs (e.g.Balahur et al., 2009)

  10. State of the ArtEVALUATION OF SUMMARIES • Types of evaluation • Intrisicevaluatethesummaryonitsown • Informativeness assessment • Quality assessment • Extrinsicevaluatehowgoodthesummaries are toperformothertasks • Pyramid • QARLA • ROUGE • Basic Elements • Indicativeness • Grammaticality • Coherence • Non-redundancy

  11. COMPENDIUM TS toolTYPES OF SUMMARIES

  12. Legend: COMPENDIUM TS toolARCHITECTURE CoreStages AdditionalStages Input Input fortheadditionalstages Types of Summaries (output)

  13. COMPENDIUM TS toolCORE STAGES SURFACE LINGUISTIC ANALYSIS • Suface Linguistic Analysis • Pre-process the input text by employing state-of-the-art tools • Sentence segmentation • Tokenisation • Part-of-Speech tagging • Stemming • Stop word identification

  14. COMPENDIUM TS toolCORE STAGES REDUNDANCY DETECTION • RedundancyDetection • Identify and remove repeatedinformation • Textual Entailment (Ferrández, 2009) • The main idea behind the use of TE for detecting redundancy is that those sentences whose meaning is already contained in other sentences can be discarded, as the information has been previously mentioned TRUE FALSE T: The man was killed last week H: The man is dead T: The man was shot in his shoulder H: The man is dead

  15. COMPENDIUM TS toolCORE STAGES • Topic Identification • Identify the most relevant topics • Term frequency (Luhn, 1958) • Most frequent words (without considering stop words) can be considered the main topics of a document TOPIC IDENTIFICATION

  16. COMPENDIUM TS toolCORE STAGES • RelevanceDetection • Compute a weightforeach sentence, dependingonits importance • TheCodeQuantityPrinciple (Givón, 1990) • Codingelementnounphrase • Sentencescontaining a nounphraseincludinghighfrequentwordswillbeconsidered more important • Score foreachsentence RELEVANCE DETECTION

  17. COMPENDIUM TS toolCORE STAGES • SummaryGeneration • Summarysize • number of words • compressionrate • Thehighestscoredsentences up to a desiredlength are selected and extracted • Sentences are ordered as theyappear in thedocument • Type of summaries (output) • GenericextractsCOMPENDIUME SUMMARY GENERATION

  18. COMPENDIUM TS toolADDITIONAL STAGES • QuerySimilarity • CosinesimilarityqSim • Type of summaries (output) • Query-focusedextractCOMPENDIUMQE • Score foreachsentence

  19. COMPENDIUM TS toolADDITIONAL STAGES • SubjectiveInformationDetection • Opinionminingtechniques (Balahur-Dobrescuet al., 2009) • Type of summaries (output) • Sentiment-basedextractCOMPENDIUMSE • Selectthehighestrelevantsentencesamongthesubjectiveones

  20. COMPENDIUM TS toolADDITIONAL STAGES • Informationcompression and fusion • Word graphs • Type of summaries (output) • Abstractive-orientedsummaryCOMPENDIUME-A • Combine extractive and new information

  21. EVALUATION AND EXPERIMENTS EVALUATION METHODOLOGY • Type of evaluation • intrinsic • What are wegoingtoassess? • COMPENDIUM in different domains and contexts • Whichcriteria are wegoingto use fortheevaluation? • Content (automatically)  ROUGE (Lin, 2004) • Quality (manually) readability & usersatisfaction

  22. EVALUATION AND EXPERIMENTS RESULTS • Newswire • Single-document generic extracts: ~ 45% (F-measure, ROUGE-1) • Multi-document: ~ 30% (F-measure, ROUGE-1) • Blogs • Multi-document sentiment-based summaries: ~ 64% (F-measure, Pyramid) • Image captions • Multi-document query-focused summaries: ~36% (F-measure, ROUGE-1) • Medical research papers • Single-document abstractive-oriented summaries: ~ 42% (F-measure, ROUGE-1)

  23. COMPENDIUMin HLT APPLICATIONS QUESTION ANSWERING • Question answering • Allows users to formulate questions in natural language and provide them with the exact information required • Objective • IntegrateCOMPENDIUMwith a Web-basedquestionansweringapproachCOMPENDIUMQE

  24. COMPENDIUMin HLT QUESTION ANSWERING • Questionanalysis • Questiontype, focus and keywords • Informationretrieval • Retrievethefirst 20 documents in Google • Summarisation • COMPENDIUMQE • Summarysizelength of snippets • Answerextraction • NamedEntities • Semantic roles • Proposed approach

  25. COMPENDIUMin HLT QUESTION ANSWERING • Data • 100 factual questions • Person • Location • Temporal • Organization • Evaluation • Correct • Incorrect • Non-answered F-measure (%)

  26. COMPENDIUMin HLT QUESTION ANSWERING • Results • Named entity-based QA • Semantic role-based QA NE-based QA 12% SR-based QA 48%

  27. CONCLUSION • The proposed techniques are appropriate for TS • Textual entailmentappropriatetotackleredundancy • CodeQuantityPrincipledetectingrelevantinformation • Word graph-basedalgorithmscompress and mergeinformation • Summaries, although imperfect in their nature, can improve the performance of other HLT tasks • QuestionAnswering

  28. Text Summarisation based on Human Language Technologies and its Applications Elena Lloret Pastor Supervisor: Dr. Manuel Palomar Seminar - June 2011

More Related