260 likes | 429 Views
Centrifuser’s output comes in three parts: Navigation; Informative extract, based on similarities; Indicative generated text, based on differences. Centrifuser can currently produce this output for documents with the same domain and genre. Centrifuser Output Min Yen Kan, 2001. Part 1
E N D
Centrifuser’s output comes in three parts: Navigation; Informative extract, based on similarities; Indicative generated text, based on differences. Centrifuser can currently produce this output for documents with the samedomain and genre Centrifuser OutputMin Yen Kan, 2001
Part 1 Informative Summaries
Informative Summaries • Informative = replaces the document with a shorter version Task Provide most important aspects of the document(s) Interaction Browsing Type Strategy Since search results are similar, put together similarities across documents
Algorithm 1. *Convert each document to a Document Topic Tree 2. *Compute Composite Topic Tree 3. Align query and topics across trees 4. Extract sentences 5. Order into summary
1. Document Topic Tree Done offline per document • Hierarchical view of the document • Layout (Hu, et al 99) • Lexical chains (Hearst 94, Choi 00) High Blood Pressure Level: 1 Style: Prose Contents: 3 Headers, … ð See also in this guide Level: 2 Order: 3 Style: Prose Contents: 5 items, … AHA Recommendation Level: 2 Order: 1 Style: Prose Contents: 1 Table, … Related AHA publications Level: 2 Order:3 Style: Bulleted Contents: …
2. Composite Topic Tree Done offline per domain and genre combination handled • Norm for a particular type of document • Create by aligning topics in example trees by similarity • Stores order, frequency and variants of each topic joined node at level 1 (e.g. disease) disease node doc tree 2 (blue) symptoms node newly joined node at level 2 (e.g. symptoms) doc tree 1 (yellow) newly joined node at level 3 (e.g. nausea) joining nodes at level 2 joining nodes at level 3
= focus topic = relevant = irrelevant = too detailed 3. Topic Alignment Done online, to find scope of information needed in summary • Use similarity metric to map query to composite and document trees • Focus topic defines 3 regions root as focus topic(e.g. About hypertension) Query: Hypertension 2nd level subtopic as focus topic (e.g. Guide to Cardiac Diseases) Composite tree Document trees
= focus topic = aligned = unaligned (no instance in documents) For more information Freq: 0.7 Causes Freq: 0.8 Treatment Freq: 0.9 Drugs Freq: 0.7 Diet Freq: 0.6 4. Sentence Extraction Cover as many topics as possible to ensure breadth of summary • Aligned topics chosen in descending typicality • Use SimFinderto choose sentences Extracted Sentences ð 1.0 (hypertension) Since blood is carried … "If a drug that blocks … 0.9 (treatment) How Can I Reduce High … How Do I Manage My … 0.8 (causes) Blood pressure is … 0.7 (drugs) "Over-the-counter“ … 0.7 (for more 2000 Heart and Stroke … information) 0.6 (diet) Everybody's looking for … *Disease* Freq: 1.0 Definition Freq: 0.2 Diagnosis Freq: 0.8 Symptoms Freq: 0.8 Nausea Freq: 0.2 Surgery Freq: 0.3 Composite topictree
5. Sentence Ordering Order by norm order to get best results • Order extracted sentences by order in composite tree (by norm) Extracted Sentences Reordered Sentences (Ordered by typicality) (Ordered by normal first appearance) 1.0 (hypertension) Since blood is carried … "If a drug that blocks … 0.9 (treatment) How Can I Reduce High … How Do I Manage My … 0.8 (causes) Blood pressure is … 0.7 (drugs) "Over-the-counter“ … 0.7 (for more 2000 Heart and Stroke … information) 0.6 (diet) Everybody's looking for … 1. (hypertension) Since blood is carried … "If a drug that blocks … 1.4 (causes) Blood pressure is … 1.5 (treatment) How Can I Reduce High … How Do I Manage My … 1.5.1 (drugs) "Over-the-counter“ … 1.5.2 (diet) Everybody's looking for … 1.6 (for more 2000 Heart and Stroke … information) ð
Part 2 Indicative Summaries
Indicative Summaries • Indicative = help decide whether document is worthwhile for retrieval Task Show salient differences from other candidates Interaction Searchingtype Strategy Identify content and non-content aspects in which each source is different
What goes into an Indicative Summary? • Examine existing indicative summaries:Library card catalog • Examine multidocument scenarios
Corpus Parameters • 82 summaries from CU’s online catalog • Healthcare domain • Catalogued types of information present • Document-derived features • Metadata features Practical Interventional Cardiology represents a practical reference for the interventional cardiologist and those in training, as well as the non-invasive cardiologist and physician. […] Rather than providing detailed and exhaustive reviews, the purpose of this book is to present practical information regarding cardiac interventional procedures. […]
Corpus Analysis Results Document Feature Document Feature Freq Freq (Document Derived) (Metadata) Topicality 100% Content Types 37% Readability 18% Internal Structure 17% Special Content 7% Title 31%Revised/Edition 28%Author/Editor 21% Purpose 18% Audience 17% … … Practical Interventional Cardiology represents a practical reference for the interventional cardiologist and those in training, as well as the non-invasive cardiologist and physician. […] Rather than providing detailed and exhaustive reviews, the purpose of this book is to present practical information regarding cardiac interventional procedures. […]
Analysis - Multidocument • Prescriptive Guidelines • Open Directory Project – website hierarchy Differences are important! 1. Differences between documents 2. Differences from the norm 3. Those relevant to the query (Grice `75) Make clear what makes a site different from the rest
Corpus Analysis Discussion • Topicality (i.e. content) is most important • Other features have a strong role • For Centrifuser • Design summary around topics • When space allows, add other features as needed • When feature differs from the norm • Future work: mimic the percentages in study • Differences drive the text • Query and norm should affect the summary content.
Algorithm 1. *Make Composite and Document Topic Trees 2. Align query and topics across trees 3. Use region ratios to compute document categories 4. Decide messages to realize 5. Order messages 6. Generate the text
2. (recap) Align query and topics Attributing the effect of the query on the generated text • Map the query to a topic • Query node divides nodes into relevant, irrelevant and intricate regions Query: Angina = focus topic Query: Treatments of Angina = relevant = irrelevant = intricate root as focus topic 2nd level subtopic as focus topic
Classifying Topics – By Norm Attributing the effect of the norm on the generated text • Relevant nodes divided into typical and rare Document topic tree = focus topic Composite topic tree = typical node (freq >= .5) = rare node (freq < .5) = unaligned topic
3. Categorizing Documents • Ratio of typical, rare, intricate and irrelevant determines category • 7 categories altogether Irrelevant Document 50+% irrelevant Specialized Document> 50+% typical, < 50% all possible typical 3 typical, 2 rare, 2 intricate and 8 irrelevant 5 typical, 2 rare, 2 intricate
Messages and the text thatthey eventually realize Other messages may include: Number of categories in summary Other optional information (e.g. content type) 4. Forming Messages Document category description [ ] Relation: category-description Args: [ docCat: atypical ] More information on additional topics which are not included in the summary are available in these files (The American Medical Association family medical guide and The Columbia University Collegeof Physicians and Surgeon complete home medical guide).. The topics include “definition” and “what are … Documents belonging to category [ ] [ Relation: category-elements Args: docCat: atypical element: AMA Guide element: CU Guide ] Topics in category [ ] Relation: has-topics Args: docCat: atypical topic: definition topic: risks [ ]
5. Ordering Messages • Inter-category –by importance of dominant topic type. • Intra-category –document category and elements before optional information.
6. Text Generation • Use a small grammar to realize the messages • Referring Expression Issues • Size of referring expressions • Re-ordering documents in the set
Task Based Evaluation Scenario: “You’ve been diagnosed with cancer…” • Compare against 3 real-world systems • IR engine (google); • Human expert (about.com). • Goals • Evaluate on subjective criteria, use think aloud techniques • See which document features best fit user need • Pilot study complete; full study going on now • Hub (yahoo);
Conclusion • An application of summarization for IR • Performs informative and indicative summarization • By using extraction and text generation techniques • To support browsing and searching http://centrifuser.cs.columbia.edu