140 likes | 300 Views
Query Chain Focused Summarization. Tal Baumel , Rafi Cohen, Michael Elhadad Jan 2014. Generic Summarization. Generic Extractive Multi-doc Summarization: Given a set of documents Di Identify a set of sentences Sj s.t. | Sj | < L The “central information” in Di is captured by Sj
E N D
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014
Generic Summarization • Generic Extractive Multi-doc Summarization: • Given a set of documents Di • Identify a set of sentences Sjs.t. • |Sj| < L • The “central information” in Di is captured by Sj • Sj does not contain redundant information • Representative methods: • KLSum • LexRank • Key concepts: Centrality, Redundancy
Update Summarization • Given a set of documents split as A = ai / B = bjdefined as background / new sets • Select a set of sentences Sks.t. • |Sk| < L • Sk captures central information in B • Sk does not repeat information conveyed by A • Key concepts: centrality, redundancy, novelty
Query-Focused Summarization • Given a set of documents Di and a query Q • Select a set of sentences Sjs.t.: • |Sj| < L • Sj captures information in Di relevant to Q • Sj does not contain redundant information • Key concepts: relevance, redundancy
Query-Chain Focused Summarization • We define a new task to clarify among key concepts: • Relevance • Novelty • Contrast • Similarity • Redundancy • The task is also useful for Exploratory Search
QCFS Task • Given a set of topic-related documents Di and a chain of queries qj • Output a chain of summaries {Sjk} s.t.: • |Sjk| < L • Sjk is relevant to qj • Sjk does not contain information in Slk for l < j
Query Chains • Query Chains are observed in query logs: • PubMed search log mining • Extract query chains (length 3) of same session / with related terms (manually) • Query Chains evolution may correspond to: • Zoom in (asthma atopic dermatitis) • Query reformulation (respiratory problem pneumonia) • Focus Change (asthma cancer)
Query Chains vs. Novelty Detection TREC Novelty Detection Task (2005) • Task 1: Given a set of documents for the topic, identify all relevant and novel sentences. • Task 2: Given the relevant sentences in all documents, identify all novel sentences. • Task 3: Given the relevant and novel sentences in the first 5 docs only, find the relevant and novel sentences in the remaining docs. • Task 4: Given the relevant sentences from all documents and the novel sentences from the first 5 docs, find the novel sentences in the remaining docs.
Novelty Detection Task • Create 50 topics: • Compose topic (textual description) • Select 25 relevant docs from News collection • Sort docs chronologically • Mark relevant sentences • Among relevant sentences, mark novel ones (not covered in previous relevant sentences). • 28 “events” topics / 22 “opinion” topics
TREC Novelty – Dataset Analysis • Select parts of documents (not full docs). • Relevant rate: events: 25% / opinion: 15% • Consecutive sentences: 85% / 65% • Relevant agreement: 68% / 50% • Novelty rate: 38% / 42% • Novelty agreement: 45% / 29%
TREC Novelty Methods • Relevance = Similarity to Topic. • Novelty = Dissimilarity to past sentences. • Methods: • Tf.idf and okapi with threshold for retrieval • Topic expansion • Sentences expansion • Named entities as features • Coreference resolution • Named entities normalization (entity linking) • Results: • High recall / Low precision • Almost no distinction relevant / novel
QCFS and Contrast • QCFS is different from Query Focus: • When generating S2 – must take S1 into account. • QCFS is different from Update: • Split A/B is not observed. • QCFS is different from Novelty Detection: • Chronology is not relevant Key concepts: • Query Relevance • Query Distinctiveness (how qi+1 contrasts with qi)
Contrastive IR • CWS: A Comparative Web Search SystemSun et al, WWW 2006 • Given 2 queries q1 and q2 • Rank a set of “contrastive pairs” (p1, p2)where p1 and p2 are snippets of relevant docs. • Method: • Retrieve relevant snippets SR1 = {p1i} and SR2 = {p2j} • Score aR(p1, q1) + bR(p2, q2) + cT(p1,p2,q1,q2) • T(p1,p2,q1,q2) = x Sim(url1, url2) + (1-x)Sim(p1\q1, p2\q2) • Greedy ranking of pairs: • rank all pairs (p1,p2) by score – take top • Remove p1top and p2top from all pairs – iterate. • Cluster pairs into comparative clusters • Extract terms from comparative clusters.
Document Clustering • A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search ResultsKummamuru et al, WWW 2004 • Desirable properties of clustering: • Coverage • Compactness • Sibling distinctiveness • Reach time • Incremental algorithm: • Decide on width n of tree (# children / node) • Nodes are represented by “concepts” (terms) • Rank concepts by score and add them under current node • Score(Sak, cj) = a ScoreC(Sak-1, cj) + b ScoreD(Sak-1, cj) • ScoreC = document coverage • ScoreD = sibling distinctiveness