600 likes | 725 Views
Dealing with Diversity in Mining and Query Processing. Jeffrey Xu Yu ( 于旭 ) Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong yu@se.cuhk.edu.hk. Books on Social Networks. Social and Economic Networks by Matthew O. Jackon
E N D
Dealing with Diversity in Mining and Query Processing Jeffrey Xu Yu (于旭) Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong yu@se.cuhk.edu.hk
Books on Social Networks • Social and Economic Networksby Matthew O. Jackon • Social Network Data Analysisby Charu C. Aggarwal • Exploratory Social Network Analysis with Pajekby Wouter de Nooy, Andrej Mrvar, and Vladimir Batagelj • Networks, Crowds, and Markets: Reasoning about a Highly Connected Worldby David Easley and John Keinberg • Networks An Introductionby M.E.J. Newman
Some Online Courses • Mining of Massive Datasets(AnandRajaraman and Jeff Ullman) http://infolab.stanford.edu/~ullman/mmds.html • Networks, Crowds, and Markets: Reasoning about a highly connected world, by David Easley and Jon Kleinberg http://www.cs.cornell.edu/home/kleinber/networks-book • Topics in Data Management & Mining – Social Networks, Laks V.S. Lakshmananhttp://www.cs.ubc.ca/~laks/534l/cpsc534l.html
Stanford Large Network Dataset Collectionhttp://snap.stanford.edu/data • Social networks • Communication networks • Citation networks • Collaboration networks • Web graphs • Amazon networks • Internet networks • Road networks • Autonomous systems • Signed networks • Wikipedia networks and metadata • Twitter and Memetracker
Graph Database http://en.wikipedia.org/wiki/Graph_database • Pregel: Google’s internal graph processing platform • Trinity: Microsoft Research Asia • Neo4j: commercial graph database • …
Diversified Ranking • Why diversified ranking? • Information requirements diversity • Query incomplete
Problem Statement • For query dependent diversity ranking, the goal is to find K nodes in a graph that are relevant to the query node, and also they are dissimilar to each other. • For query independent diversity ranking, the goal is to find K prestige nodes in a graph that are dissimilar to each other. • Main applications • Ranking nodes in social network, ranking papers, etc.
Challenges • Diversity measures • No wildly accepted diversity measures on graph in the literature. • Scalability • Most existing methods cannot be scalable to large graphs. • Lack of intuitive interpretation.
Existing Methods • Grasshopper [Zhu, et al., HLT-NAACL’07] • ManiRank [Zhu, et al., WWW’11] • DivRank [Mei, et al., KDD’10] • DRAGON [Tong, et al., KDD’11] • Resistive Graph Centers [Dubey, et al., KDD’11]
Grasshopper/ManiRank The main idea Work in an iterative manner. Select a node at one iteration by random walk. Set the selected node to be an absorbing node, and perform random walk again to select the second node. Perform the same process K iterations to get K nodes. No diversity measure Achieving diversity only by intuition and experiments. Cannot scale to large graph (time complexity O())
Grasshopper/ManiRank • Initial random walk with no absorbing states • Absorbing random walk after ranking the first item
DivRank Based on a vertex-reinforced random walk. No diversity measure. Convergence properties is not clear. Time and space complexity is
DRAGON, Resistive Graph Centers DRAGON [Tong, et al., KDD’11] Diversity measure lacks of clear topological interpretation Resistive Graph Centers [Dubey, et al., KDD’11] Based on personalized PageRank with a learnable teleportation parameter. Cannot be scalable to large graphs.
A Summary • Comparison with existing methods
Our Approach • The main idea • Relevance of the top-K nodes (denoted by a set S) is achieved by the large (Personalized) PageRank scores. • Diversity of the top-K nodes is achieved by large expansion ratio. • Expansion ratio of a set nodes S: σ(S)=|N(S)|/n • Larger expansion ratio implies better diversity
The K-step Expansion • K-step expansion ratio of S: σk(S)=|Nk(S)|/n • Our diversity measures
A Discrete Optimization Problem • Diversified ranking problem on graph as a discrete optimization problem. • Submodularity • F(S) is shown to be submodular and non-descreasing. • The greedy algorithm • A 1-1/e approximation algorithm for solving Eq. (1). • Linear time and space complexity w.r.t. the size of the graph.
The Greedy Algorithm • Works in K rounds • Select a node with maximal marginal gain at one round Marginal gain
Generalized Diversified Ranking Optimization • Maximize Fk(S) subject to cardinality constraint • |S| <= K • Submodularity • Fk(S) is shown to be submodular and non-descreasing. • Randomized greedy algorithm • Near 1-1/e approximation algorithm. • Linear time and space complexity w.r.t. the size of the graph.
Generalized Diversified Ranking Optimization • Randomized greedy algorithm • Same idea as the greedy algorithm • Works in K rounds • At each round, select the node with maximal marginal gain. But, evaluating the maximal marginal gain is expensive. • Our idea: Use a probabilistic counting data structure to sketch the k-step neighborhood for each node. Marginal gain
FM Sketch and Its Properties • A probabilistic counting structure, devised by Flajolet and Martin. • Be used to estimate the cardinality of a multi-set using only logC+t bits, where C denotes the cardinality and t is a small constant. • Each FM Sketch is a log C+t bitmap. • Advantage: To estimate the cardinality of the union of two multi-sets, we only need to do a bitwise-OR between to FM Sketches.
The Randomized Greedy Algorithm • Randomized greedy algorithm • For each node u, use FM Sketch to sketch Nk({u}) • Use the following rule to sketch Nk({u}), which can be implemented in a recursive way • Use FM sketch to sketch Nk(S) • Evaluating the marginal gain can be implemented by a bitwise-OR between Nk(S) and Nk({u})
Experimental Studies • We conduct experiments on 5 real networks (3 collaboration networks, 1 citation network, and 1 social network). • We show some results with Flickr, which is a popular photo shared website (from ASU social computing data repository). • Undirected social network (80,513 nodes and 5,899,882 edges, and 195 different groups)
Make a Top-K Algorithm Diversified • Existing top-search algorithms • Search results are ranked independently • When searching “apple” in google image, 9 out of top 15 results are the logo of Apple Inc. The result of searching “apple” in Google image
p1 p4 p2 p4 p1 p2 p3 p2 p1 p4 p3 p3 3 3 3 3 3 3 3 3 3 3 3 3 a1 a1 a1 a1 a1 4 4 4 4 4 w3 w4 w3 w1 w4 w1 w2 w1 w2 w3 w2 w4 Structural Keyword Search (1) DBLP • Example: Keyword Search in Graphs • Input: a graph with text information on each node, and a user given keyword query • Output: top-k of minimal Steiner trees that contain all user given keywords “graph patterns” “keyword search” v1 v2 v3 v4 Author: Jiawei Han Action: Write Paper: Optimizing Index for Taxonomy Keyword Search Paper: Mining Graph Patterns Paper: Mining Significant Graph Patterns by Leap Search Paper: Keyword Search in Text Cube: Finding Top-k Aggregated Cell Documents
p4 p1 p2 p1 p2 p4 p3 p3 3 3 3 3 3 3 3 3 a1 a1 a1 a1 4 4 4 4 w4 w4 w3 w1 w2 w1 w2 w3 Structural Keyword Search (2) • Suppose the similarity of and is , e.g., • Let • is better than because and are similar with each other • is better than because has a larger total score 0.6 v1score=0.8 v2score=0.5 v3score=0.5 v4score=0.4 0.6 0.6 0.2 0.2 0.6
Diversified Top-K • We should consider both similarity and score • Let be a list of search results • Let be the score of result • Let be the similarity of and • For any , • and are similar • : a user given threshold • Diversified top- results result : • At most results: • No two results in are similar • Total score of results in is maximized
v5 v4 v1 v2 v3 v6 v6 v1 v4 v5 v3 v2 3 3 3 3 3 3 3 A Diversity Graph 8 8 • Diversity Graph • Undirected graph • , , there is an edge (,) in is similar to • The diversified top-result set is an independent set of 6 6 7 7 7 7 10 10 1 1
Existing Top-K Search Frameworks • Most existing top-K search frameworks avoid exploring all search results by finding an early stop condition. • Incremental Top-K • Results are generated one by one in ranked order • Stops when K results are output • Bounding Top-K • Results are generated not necessarily in ranked order. • A non-increasing score upper bound for unseen result u is maintained. • Stop when the K-th largest score generated is no smaller than u.
Step 1 • Check the stop condition sufficient() • Stops if sufficient() is satisfied Our Framework • We support the existing top-K frameworks • Results are generated one by one • Stops if a certain stop condition is satisfied • Our framework • We extend the existing algorithms to get top-K diversified results by three new functions. • sufficient(): a new early stop condition • necessary(): the necessary stop condition • div-search(): search top-k diversified results on the current results Step 3 • Check the necessary() condition • If necessary() is satisfied, search the diversified top-K results using div-search() • Go to Step 1 Step 2 • Generate the next result using the original top-K algorithm
Sufficient Stop Condition • Sufficient stop condition sufficient() • : the set of current generated results • : an upper bound of the optimal solution calculated from current generated results • : the current diversified top- results with score • : the score upper bound of all unseen results • For each , in the ideal situation, for the unseen results, all the remaining results are set to be • We have • The sufficient stop condition is
Necessary Stop Condition • Necessary stop condition necessary() • : the set of current generated results • Assume the stop condition of the original algorithm is satisfied • Otherwise the algorithm cannot stop • : the set of results when the last time necessary() is satisfied (or if necessary() is never satisfied) • If for a certain , we need at least more results generated in order to get results • The necessary stop condition is
v100 u100 v2 v0 v3 v0 u3 u2 v100 u100 v0 v3 v1 v2 u1 u1 u3 u2 3 3 3 3 3 3 3 3 3 3 The Possible Search Algorithms • Given the diversity graph for the current generated result set • Greed is Not Good Finding on is an NP-Hard problem 100 100 … … 99 99 99 99 99 99 99 99 … … 0.5 1 1 1 1 1 0.5 1 Optimal Solution: score=9900 Greedy Solution: score=199
NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP NP Three New Search Algorithms • We propose three exact algorithms • div-astar: an A* based approach • div-dp: decompose div-astar using operator • div-cut: further decompose div-dp using operators and div-astar div-dp div-cut
An A* Based Approach • We use a heap to maintain partial solutions • Each partial solution is with form • the set of results selected in the partial solution • : the total score of results in • : the upper bound of score if is expanded to a full solution • Entries in are expanded in non-increasing order of • The algorithm stops when of the next soution is no larger than the score of the current best solution
An A* Based Approach • Calculation of • is the set of adjacent nodes of in • The equation is a relaxation of the optimal solution w.r.t. • is to avoid generating redundant results • can be calculated in time in the worst case s.t.
3 3 3 3 3 3 An A* Based Approach • An example () 8 6 7 7 10 3 Diversity graph Step 1: Expand node (), with
3 3 3 3 3 3 An A* Based Approach • An example () 8 6 7 7 10 3 Diversity graph Step 2: Expand node (), with
3 3 3 3 3 3 An A* Based Approach • An example () 8 6 7 7 10 3 Diversity graph Step 3: Expand node (), with
3 3 3 3 3 3 An A* Based Approach • An example () 8 6 7 7 10 3 Diversity graph Step 4: Expand node (), with
3 3 3 3 3 3 An A* Based Approach • An example () 8 6 7 7 10 3 Diversity graph Step 5: Expand node (), with Current best score is , and next best score is : stop Optimal solution:
A DP Based Approach • The diversity graph may contain many disconnected components • It is costly to apply A* algorithm on the whole diversity graph • Combine the results of disconnected components using operator based on Dynamic Programming (DP) • Dynamic Programming • Suppose contains two disconnected components and • State : the optimal score of the diversified top- results on • State transition equation:
3 3 3 3 3 3 A DP Based Approach 10 6 8 6 9 7 • An Example () 7 8 7 10 optimal solution: {,,,} 1
A Cut Point Based Approach • Cut point of graph • Suppose is a connected graph • A cut point is a point whose removal makes disconnected • can be further decomposed using cut points • Suppose is a cut point of , there are two situations • : is excluded in the final solution • After removing , becomes several disconnected components • : is included in the final solution • After removing and all ’s adjacent nodes, becomes several disconnected components • Add to each result in • and are combined using operator to compute
A Cut Point Based Approach • Let be a cut point of • Let be the solution by excluding • Let be the solution by including • and are mutually exclusive with each other • : the optimal score of diversified top- results on • Calculating
A Cut Point Based Approach • Handling multiple cut points • Step 1: Construct a cup-point tree (cptree) • Each node: associated with a cut point (leaf node is associated with a virtual cut point) • Each edge: associated with a subgraph that connects two cut points (the subgraph can be empty or disconnected) • A sample cptree: • Step 2: Search the cptree • In a bottom-up fashion
A Cut Point Based Approach • An Example • Suppose , , , have been computed • We now compute and
A Cut Point Based Approach • An Example • Computing • Computing • (Case 1) is excluded: • (Case 2) is included: • is the result after removing adjacent nodes of from • We have • can be computed similarly
A Cut Point Based Approach • An Example • Computing • Computing • (Case 1) is excluded: • (Case 2) is included: • We have • can be computed similarly • Do not forget to add {} to all the results of