750 likes | 867 Views
Preference and Diversity-based Ranking in Network-Centric Information Management Systems. PhD defense Marina Drosou Computer Science & Engineering Dept. University of Ioannina. Why diversify ?. Car. Animal. Sports Team. “Mr. Jaguar’’. Thesis Goal.
E N D
Preference and Diversity-based Ranking in Network-Centric Information Management Systems PhD defense Marina Drosou Computer Science & Engineering Dept. University of Ioannina
Why diversify? Car Animal Sports Team “Mr. Jaguar’’ Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Thesis Goal • This PhD thesis concerns the development, implementation and evaluation of models, algorithmsand techniques for the ranking of information being presented to users of network-centric information management systems • This ranking is based on the importanceof each piece of information. We consider that importance is influenced by both relevance to user information needs and diversity: • Relevanceis important so that users are only presented with the most useful results according to their needs • Diversityensures that the received results do not all contain similar information. Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Outline • Search Result Diversification: Introduction & Related Work • Content Diversification using Indices • DisC Diversity: Diversification based on Dissimilarity and Coverage • Poikilo: Evaluating the Results of Diversification Models and Algorithms • Summary Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Outline • Search Result Diversification: Introduction & Related Work • Problem Definition • Variations • Algorithms • Content Diversification using Indices • DisC Diversity: Diversification based on Dissimilarity and Coverage • Poikilo: Evaluating the Results of Diversification Models and Algorithms • Summary Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Problem Definition Given: • P = {p1, …, pn} • k ≤ n • d: a distance metric • f: a diversity function Given a set P of items and a number k, select a subset S* of P with the kmost diverse items of P Find: Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
What it means • Given a set P of query results we want to select a representative diverse subset S* of P • What does diverse mean? • Content: dissimilar items • e.g., distant location on a map, different attribute values in tuples • Coverage: different aspects, perspectives, concepts • e.g., different interpretations of a keyword in web search, different topics • Novelty: items not seen in the past • e.g., novel results in a notification service Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Content-based diversity Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Coverage-based diversity • Basic idea: Find a set of results that cover different interpretations of the query • Common assumptions: • A taxonomy exists • Both queries and results may belong to many categories • Statistics on the distribution of user intents have been collected • Result independence • Probabilistic view of the problem Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Novelty-based diversity • Novelty: the need to avoid redundancy (vs. Diversity: the need to resolve ambiguity) • Intuitively: an item should be returned in the ithposition of the listif • it is relevant • the previous (i-1) items do not contain the same information • Information is partitioned into “nuggets” • Often, human judges decide what is relevant or not for each nugget (IR approach) Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Adding relevance in the mix • We must not forget: Relevance to the query is also important! • Results must be both relevant and diverse • Two alternatives: • Select the k most diverse results out of the top-m most relevant ones, m > k • Include diversity into the ranking criterion • Augmenting diversity function with relevance • Adapting IR criteria, e.g., discounted cumulative gain(DGC) at position i Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Adding relevance in the mix • Augmenting diversity functions with relevance: • MaxMin: • MaxSum: • Mono-objective formula: • and others • (where is a relevance function and) Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Problem Complexity • The problem of choosing diverse items is NP-hard • This follows from the MAX COVERAGE/SET COVER problems • Intuitively: • To find the most diverse subset S* of all items P we have to compute all possible combinations of k items out of |P| and keep the one with the maximum diversity Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Solving the problem • Thus, we use heuristics for approximate solutions • Greedyheuristics: • Selecting items one by one until we have k of them • Interchangeheuristics: • Start with a random solution and interchange items that improve the objective function • Also: • Neighborhood heuristics: Disqualify items close to the ones already selected • Simulated Annealing: Apply simulated annealing to avoid local maxima • and others Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Related Work Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Outline • Search Result Diversification: Introduction & Related Work • Content Diversification using Indices • Model • Diverse set computation • Combining diversity & relevance • DisC Diversity: Diversification based on Dissimilarity and Coverage • Poikilo: Evaluating the Results of Diversification Models and Algorithms • Summary Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Introduction • We focus on content-based diversification • MaxMin • Basic idea: employ indices for the • efficient computation of diverse • Items • Cover Trees • We also define the Continuous k-diversity problem Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
The Cover Tree • A leveled tree where each level is a “cover” for all levels beneath it • Items at higher levels are farther apart from each other than items at lower levels level level level Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Cover Tree Invariants - Nesting • Nesting: , i.e., once an item appears at some level, then every lower level has a node associated with p2 level p1 p2 p3 level p3 p1 p2 level Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Cover Tree Invariants - Covering • Covering: For every , there exists a , such that and the node associated with is the parent of the node associated with p2 level bl-1 p1 p2 p3 level p3 p1 p2 level b: the “base” of the tree l: the level of pi Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Cover Tree Invariants - Separation • Separation: For all distinct , it holds that p2 level bl-1 p1 p2 p3 level p3 p1 p2 level > bl-2 Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Example Items indexed at the first ten levels of the same Cover Tree Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Cover Tree Representations • After an item appears in some level of the tree, then is a child of itself at all levels below . Implicit Representation Explicit Representation p2 p2 p1 p1 p2 p3 p3 p3 p1 p2 space depending on P O(n) space Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Dynamic Construction • Items can be inserted and deleted from a Cover Tree in a dynamic fashion • Insertion: • Starting from the root, descend towards the candidate nodes that can cover the new item p • Continue until a level Cl is reached where p is separated from all other items • Select as parent a candidate node of Cl+1that covers p • Deletion: • Descend the tree looking for p, keeping note of candidate nodes that can cover the children of p • Remove p and reassign its children to the candidate nodes Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Level Family of Algorithms • The higher the tree level, the farthest apart its nodes. Thus, by selecting items from nodes at high levels, we retrieve more diverse results • Let be the first level with at least k items • Level-Basic: Select k random items from • Level-Greedy: Greedily select kitems from • Level-Inherit: Select all items in and greedily select k-|| items from Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Approximation Bound Let P be a set of items, k 2, dOPT(P,k) the optimal minimum distance for the MaxMin problem and dCT(P,k) be the minimum distance of the diverse set computed by the Level-Basic algorithm. It holds that: dCT(P,k) dOPT(P,k), where = (b-1)/(2b2) (Proved by exploiting the covering invariant of the tree to bound the level where the least common ancestor of any two items of the optimal solution appears in the tree) Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Cover Tree implementation of Greedy • Any Cover Tree can be employed for implementing the greedy heuristic • ½-approximation of the optimal solution • We perform k descends of the tree, using one of the following pruning rules: CT PRUNING RULE: Let p and q be two sibling nodes at level l in a CT. If , then no node in the subtree of q can be further apart from S than p WCT PRUNING RULE: Let p and q be two sibling nodes at level l in a CT. If , then no node in the subtree of q can be further apart from S than p Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Batch Construction • If all items of P are available, we can perform a batch constructionof the Cover Tree • We call such trees “Batch Cover Trees” (BCTs) • As we descend a BCT, we get items in the order selected by Greedy • Algorithm: • The leaf level Clcontains all items in P • We greedily select items from Cl with distance larger than bl+1 and promote them to Cl+1 • The rest of the items in Clare distributed as children among the new nodes of Cl+1 • Continue until we reach the root level Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Adding relevance • Two approaches: • Incorporate relevance to the distance function • Use relevance to select items from the tree, e.g., mmr • Level-Greedy • Level-Hybrid: Greedily select k items from and the k most relevant descendants of items in Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Continuous Model • We consider a streaming scenario, where new items arrive and older items expire • We want to provide users with a continuously updated subset of the top-k most diverse recent items in the stream • We consider a sliding-window model: Window Pi-1 Window Pi jump step w Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Continuous k-Diversity Problem • Let be two subsequent jumping windows • For each , we seek to select a diverse subset , where the additional two constraints hold: • Durability: • Once selected as diverse, an item remains as such until it expires • Freshness: Let be the newest item in . Then, \ with , such that, • Items are selected in the same order they are produced Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Continuity Requirements • Items in the tree are marked as valid or invalid: • Freshness: non-diverse items that are older than the newest diverse item from the previous window are marked as invalid in the cover tree and are not further considered. • Durability: Let r be the number of diverse items from previous windows that have not yet expired. We select k-r new valid diverse items from the new window. Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Building Batch Cover Trees • We measure the extra cost of building a BCT as compared to executing the greedy heuristic (GR) for k = n • This extra cost corresponds to assigning nodes to suitable parents to form the tree levels Extra Cost np – nearest parent heuristic (choose closest candidate parent). The quality of the solution is the same for BCT and GR. Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Building Incremental Cover Trees • Building ICTs requires a small fraction of the cost required for the corresponding BCTs • However, the quality of the solutions provided by ICTs is comparable to that of BCTs (and, thus, GR) Extra Cost • For trees with 10,ooo items: • Insertion cost: ~2.6 msec • Deletion cost: ~10 msec Inserting/Removing items after a window jump depends on the size of the window and the jump step but is much faster than re-building a BCT for the new set of items Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Pruning Pruning is even better for non uniform datasets, since each selection of a diverse item results in pruning a largest number of items around it Also, pruning is better for large values of λ Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Streaming Data • We compare ICTs against SGR, a streaming version of GR: • At each window, we keep any remaining diverse items from the previous window (durability) and let GR select items from the new window satisfying freshness • Comparable achieved diversity, while ICTs are much faster • Retrieving the top-100 items from an ICT with 1,000-10,000 items requires ~1.5 msec • Executing SGR requires 3.2 sec for 5,000 items and more than 15 sec for 10,000 items Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Summary • We proposed an indexed-based diversification approachbased on Cover Trees • We provided a new suite of algorithms along with theoretical results for the quality of our approach • We studied the diversification problem in a dynamic setting, where items change over time and defined continuity requirements that the diversified items must satisfy Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Related Publications • M. Drosou and E. Pitoura, Diverse Set Selection over Dynamic Data, in IEEE TKDE (to appear) • M. Drosou and E. Pitoura, Dynamic Diversification of Continuous Data, EDBT 2012 Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Outline • Search Result Diversification: Introduction & Related Work • Content Diversification using Indices • DisC Diversity: Diversification based on Dissimilarity and Coverage • DisC Diversity • Algorithms • Comparison with other models • Incremental DisC • Poikilo: Evaluating the Results of Diversification Models and Algorithms • Summary Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
DisC Diversity • What is the right size for the diverse subset S? • What is a good k? • What if… instead of k, a radius r? • Given a result set P and a radius r, we select a representative subset S ⊆ P such that: • For each item in P, there is at least one similar item in S (coverage) • No two items in S are similar with each other (dissimilarity) Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
DisC Diversity Zoom-in Zoom-out Local zoom • Small r: more and less dissimilar points (zoom in) • Large r: less and more dissimilar points (zoom out) • Local zoomingat specific points Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
DisC Diversity • Formal definition: • Let P be a set of objects and r, r ≥ 0, a positive real number. A subset S ⊆ P is an r-Dissimilar-and-Covering diverse subset, or r-DisC diverse subset, of P, if the following two conditions hold: • coverage condition: ∀pi ∈ P, ∃pj ∈ N+r (pi), such that pj ∈ S • dissimilarity condition:∀pi, pj ∈ S with pi ≠pj, it holds that d(pi, pj) > r • Since a DisC set for a set P is not unique • We seek a concise representation → the minimum DisC set Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Graph model • We use a graph to model the problem: • Each item is a vertex • There exists an edge between two vertices, if their distance is less than r r Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Graph model • Finding a minimum r-DisCdiverse subset of a set P is equivalent to finding a minimum IndependentDominatingset of the corresponding graph • Independent: no edge between any two vertices in the set • Dominating: all vertices outside connected with at least one inside • This is an NP-hard problem Dominating, not independent Dominating and independent Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Computing DisC subsets • A basic or greedy approach: • select random items or items with large neighborhoods Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
How smaller is the (optimal) minimum DisC set? where B the maximum number of independent neighbors of any item in P • i.e., each item has at most B neighbors that are independent from each other B depends on the distance metric and data cardinality • We have proved that: • for the Euclidean distance in the 2D plane: B = 5 • for the Manhattan distance in the 2D plane: B = 7 • for the Euclidean distance in the 3D plane: B = 24 The size of any r-DisC diverse subset S of P is B times thesize of any minimum r-DisC diverse subset S* Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Raising the dissimilarity condition • When we consider only coverage: Let Δbe the maximum number of neighbors of any item in P; the size of any covering (but not dissimilar) diverse subset S of P is at most lnΔtimes larger than any minimum covering subset S* Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Adding weights • We also consider weights • e.g., indicating relevance • We now seek the DisC set S with the minimumvalue of • When all weights are equal, the problem is reduced to finding a minimum r-DisC subset Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Multiple radii • We want to allow different areas of the data to contribute more or less items to the diverse set • The problem now loses its symmetry • Two interpretations: • pi can represent all items lying at a distance at most r(pi) around it(Covering problem) • pi can be represented only by items lying at a distance at most r(pi) around it (CoveredBy problem) Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems
Multiple radii variations • The problem is now modeled via a directedgraph • Directed graphs do not always have an independent dominating set! • We provide heuristic algorithms that always locate a valid DisC set • Covering: start with items with larger radii • CoveredBy: start with items with smaller radii A set P Covering CoveredBy Marina Drosou, Preference and Diversity-based Ranking in Network-Centric Information Management Systems