CSC 9010: Text Mining Applications Document Summarization

CSC 9010: Text Mining ApplicationsDocument Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851

Document Summarization • Document Summarization • Provide meaningful summary for each document • Examples: • Search tool returns “context” • Monthly progress reports from multiple projects • Summaries of news articles on the human genome • Often part of a document retrieval system, to enable user judge documents better • Surprisingly hard to make sophisticated • Surprisingly easy to make effective

Document Summarization -- How Three general approaches: • Extract predefined summary. • Useful in highly structured environments where you can specify format. Typically very good summaries. • Capture in abstract representation, generate summary • Useful in well-defined domains with clearcut information needs. • Extract representative sentences/clauses. • Useful in arbitrarily complex and unstructured domains; broadly applicable, and gets "general feel".

Extract Predefined Summary • Documents have a well-defined format. • Format includes a summary or abstract explicitly written by document author. • Text mining may reorganize, regroup, restructure summaries. • Example: • People working on multiple projects write monthly reports based on what they have done, one sentence/project. • Reporting system collects person-level reports and reorganizes into project-level reports.

Extract Predefined Summary: Methods • Extraction using some or all of • NLP for document parsing/chunking (finding abstract) • standard computer science: database retrieval, string processing, etc. • Reorganizing may be done using • explicit fields specified by author • keywords searched for in documents • business rules which capture knowledge about who is working on what tasks and projects • Grouping can shade into document classification for long summaries, ill-defined match to categories

Extracting Predefined Summaries: Advantages and Disadvantages • Advantages • Summaries reflect intent of author. • If part of an overall reporting system can actually make it simpler for author. • Incremental effort for author not large. • Disadvantages • Incremental effort for author not zero either. • Only feasible in structured situation where requirement can be defined ahead of time. • Can't be used to summarize a group of documents. • Not all authors write good summaries.

Capture and Generate • Documents can have arbitrary format • Knowledge needed is well-defined. • Often information need is for summarizations across multiple documents • Example: • Summarizing restaurant reviews. Take newspaper articles and produce price range, kind of food, atmosphere, quality, service.

Capture and Generate: Methods • State of the art: • Create "template" or "frame" • Represent the knowledge you want to capture • Extract Information to fill in frame • Standard information extraction problem • Typically relatively large frames with relatively few relations; mostly facts. • Generate based on template • Relatively simple "fill-in-the-blank" • More complex based on parse tree. • Still basically research: parse entire document into parse tree tied to rich semantic net; apply rules to trim tree; generate continuous narrative.

Capture and Generate: Advantages and Disadvantages • Advantages: • Produces very focused summaries. • Can readily incorporate multiple documents. • Not dependent on authors • Disadvantages • Assumes information need is clearly defined. • Information extraction component development time is significant • Document parsing slow; probably not real-time. • Comment: • Makes no attempt to capture author's intent

Extract Representative Sentence • Document format can be arbitrary • Document content can also be arbitrary; information need not clearcut • Summarization consists of text extracted directly from document. • Examples: • Context returned by Google for each hit • Google News summaries.

Find Representative Sentences: Method • Typically, choose representative individual terms, then broaden to capture sentence containing terms. The more terms contained, the more important the sentence. • If in response to a search or other information request, the search terms are representative • If no prior query, TF*IDF and other BOW approaches. May use pairs or n-ary groups of words. • May add a layer of rules using position, some specific phrases such as "In summary,".

Find Representative Sentences: Advantages and Disadvantages • Advantages • Can be applied anywhere. • Relatively fast (compared to full parse) • Provides a good general idea or feel for content. • Can do multiple-document summaries. • Disadvantages • Often choppy or hard to read • Does poorly when document doesn't contain good summary sentences. • Can miss major information

Summary • Appropriate approach depends on what is known about the documents, the domain, and the information need. • All of the major approaches in use provide useful information in a reasonable time frame. • None of the automated methods is yet close to a good human summarizer. Research in this area is advancing fast, though.

Some Useful References • This is been a seriously simplified presentation; I am focusing mostly on applications. Here are some references for more detail: • http://www.cs.unm.edu/~storm/TSPresent.html. Detailed overview of text summarization history, methods and current state. • http://www.summarization.com/. Bibliography, tools, conferences, research. Some good resources. • http://clg.wlv.ac.uk/help/summarisation.php. Relatively simple overview with some good links. • http://citeseer.nj.nec.com/525002.html. Paper on summarization using GATE.

CSC 9010: Text Mining Applications Document Summarization

CSC 9010: Text Mining Applications Document Summarization

Presentation Transcript

Text summarization

Text Summarization

CSC 9010: Text Mining Applications Document-Level Techniques

Text summarization

Document Summarization

Concept based Multi-Document Text Summarization

Document Summarization

AUTOMATIC TEXT SUMMARIZATION

CSC 9010: AeroText, Ontologies, AeroDAML

Automatic Text Summarization

Applications of Text Mining

Text summarization

Text summarization

Text summarization

Text Summarization

CSC 9010

Document Summarization

Automatic text summarization

CSC 9010

CSC 9010: Information Extraction

Text Summarization

LexPageRank: Prestige in Multi-Document Text Summarization