Topical Query Decomposition

Topical Query Decomposition Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis Yahoo! Research Barcelona, Spain KDD 08

Abstract • Given a query and a document retrieval system • To produce a small set of queries whose union of resulting documents corresponds approximately to that of the original query. • Set cover problem • Greedy algorithm • Clustering problem • Two-phase algorithm based on hierarchical agglomerative clustering (dynamic programming)

Introduction • A query log L • A list of pairs < q, D(q) > • q: query, • D(q): its result a set of documents that answer query q • Q(q) the maximal set of queries pi, where for each pi, the set D(pi) has at least one document in common with the documents returned by q

The goal is to compute a cover. • Selecting a subcollection CQ(q7) such that it covers almost all of D(q7)

Problem Statement – 1/3 • Red-Blue set cover problem • U={b1,…bn, r1,…rm} ( for a query q ) • B={b1,…bn} (i.e., document set) • R={r1,…rm} (i.e., query set) • S={S1,…,Sk} is provided from L (query log L) • SiU • SiB: blue points in Si (SiB= Si B) • SiR : red points in Si (SiR= Si B) • Goal:To find a subcollection C ⊆ S thatcovers many blue pointsof Uwithout covering too many red points.

Problem Statement – 2/3 • For each query q, the candidate queriesQ(q) • For each set Si with blue and red points, its weight is • scatter sc(Si) (coherence: opposite of scatter)

Problem Statement – 3/3 • Our goal is to find a subcollection C ⊆ S that covers almost all the blue points of U and has large coherence. • More precisely, we want that C satisfies the following properties: • Cover-blue • Not-cover-red • Small-overlap • Coherence

Greedy Algorithm – 1/2 • At i-th iteration , minimizes s(S,VB,VR) • lC, lR, lO are parameters that weight the relative importance of the three terms. • VB : blue balls were already selected at before iterations • VR : red balls were already selectedat before iterations D. Peleg. Approximation algorithm for the label-covermax and red-blue set cover problem. Journal of Discrete Algorithms, 2007

Greedy Algorithm – 2/2

Integer Programming • Si+S2+….Sl <=10 • Si <= 1

Clustering-Based Method • Two-phase approach • First phase: all points in set B are clustered using a hierarchical agglomerative clustering algorithm. (CLUTO toolkit) • Second phases: to match the clusters of the hierarchy produced by the agglomerative algorithm with the sets of S. • The main idea is to match sets of S into clusters of G • Every node T ∈ G corresponds to a cluster • T(B) be the set of points in B

Clustering-Based Method DendrogramG

Clustering-Based Method -Dynamic Programming - 1/2 • Complete Coverage: • for each set SS v.s. for each node T∈ G , • Matching score m(T, S) • m*(T) the score of the best matching set in S. • Optimal cost of covering the points of TB with sets in S.

Clustering-Based Method -Dynamic Programming - 2/2 • Partial Coverage: • lU weights the relative importance between the two terms, the scatter cost of the sets S and the number of uncovered points.

Application • Query log L : 2.9 million distinct queries • A majority of users only looks at the first page of results, while few users request more result pages. • D(q): any user asking for q in the query log navigated, and consider the set of result documents for the query • 24 million distinct documents seen by the users

Application - Candidate queries for the cover • For each query q, the candidate queries Qk(q)

Application - Results • A set of 100 queries were randomly picked from top 10,000 queries submitted by users. • Cost of k queries • The number of documents included outside the set D(q) • Average numbre of queries covering each element • Coverage after the top k candidates have been picked

Conclusions • A novel problem : • Topical query decomposition • Elegant solutions • red-blue metric set cover • clustering with predefined clusters. ( hierarchical agglomerative clustering ) • The set-cover formulation provides solutions of better quality • Code and data for reproducing the results shown in Table 3 is available at • http://www.yr-bcn.es/querydecomp/ .

Topical Query Decomposition