280 likes | 464 Views
The Community-search Problem and How to Plan a Successful Cocktail Party. Mauro Sozio and Aristides Gionis Presented By: Raghu Rangan , Jialiang Bao , Ge Wang. Introduction. Graphs are one of the most popular data representation Have a wide range of applications
E N D
The Community-search Problem and How to Plan a Successful Cocktail Party Mauro Sozio and Aristides Gionis Presented By: Raghu Rangan, JialiangBao, Ge Wang
Introduction • Graphs are one of the most popular data representation • Have a wide range of applications • Communities and social networks as graphs have gained attention • People represented as nodes • Connection between people are edges • This paper focuses on the query-dependent variant of the community search problem
Planning a Cocktail Party • Participants should be “close” to the organizers (e.g. a friend of a friend). • Everybody should know some of the participants. • The graph should be connected. • The number of participants should not be too small • Not too large either • This is difficult Bob Alice Charlie David
Community Search Problem • Need to find the community that a given set of users belongs to. • Given a graph and a set of nodes, find a densely connected subgraph containing the set of users given in input.
Related Work • Connectivity Subgraphs • Work has been done to find a subgraph that connects as set of query nodes • Not enough • Need to extract best community that query nodes define • Community Detection • Finding communities in large graphs and social networks • Typical approach looks at optimizing modularity measure • Problem is most methods consider static community detection problem
Related Work • Team Formation • Lappas et. al studied this problem • Given a network where nodes are labeled with a set of skills • Find subgraph in which all skills are present and communication cost is small • A variant of this problem is present for cocktail party planning
Problem definition • Problem 1: • Given an undirected(connected) graph G(V,E), a set of query nodes Q, a goodness function f, find the most dense sub graph H = (VH, EH) of G, such that: • VH contains Q (all query nodes must be included) • H is connected • f(H) is maximized among all feasible choices of H (the large the better)
Query node and goodness function? • Problem 1: • Given an undirected(connected) graph G(V,E), a set of query nodes Q, a goodness function f, find the most dense sub graph H = (VH, EH) of G, such that: • VH contains Q (all query nodes must be included) • H is connected • f(H) is maximized among all feasible choices of H (the large the better) What is query node? • They are the nodes that form the community. What is goodness function? • It is to define the dense degree. • Average degree • Minimum degree
Why not choose Average degree function? • Lead to unintuitive result • Easy to add unrelated but dense part
Problem definition • Problem 2: • Given an undirected(connected) graph G(V,E), a set of query nodes Q, a goodness function f, and a number d as distance, find the most dense sub graph H = (VH, EH) of G, such that: • VH contains Q (all query nodes must be included) • H is connected • DQ(H) <= d • f(H) is maximized among all feasible choices of H (the larger the better) We have distance constraint now.
Maximizing the minimum degree • Greedy algorithm: • Steps: • Set G0 = G, • Delete the minimum degree nodeand all its edges, go to 2 • Termination condition: • Either: • At least one of the query nodes Q has minimum degree • The Query node Q is no longer connected
Time complexity? • Greedy can be implemented in linear time. • Idea: • Make separate lists of nodes with degree d, for d = 1, …, n • When Remove a node u from G, a neighbor of u with degree d will be remove from list d to list d – 1. So total amount of moves is O(m) (m is the edge ) • We can locate the min node in O(1) time, so running time is O(n + m)
Generalization to monotone functions • Minimum degree function is actually a member of this family of functions. • But sometimes we want some other functions to define the node density.
Problem definition • Problem 3: • Given an undirected(connected) graph G(V,E), a set of query nodes Q, a node monotone function f, and a number d as distance, find the most dense sub graph H = (VH, EH) of G, such that: • VH contains Q (all query nodes must be included) • H is connected • DQ(H) <= d • f(H) is maximized among all feasible choices of H (the larger the better) We have node monotone function now.
Greedy Gen • Greedy algorithm: • Steps: • Set G0 = G, • Delete the minimum degree node • Delete the node which f(G,V) is minimum, and all its edges, go to 3 • Termination condition: • Either: • At least one of the query nodes Q has the minimum f(G,v) • The Query node Q is no longer connected
Communities with Size Restriction • Drawback of previous algorithm • They may return subgraphs with very large size.
Complexity • Formal definition of minimum degree with upper bound on the size • An integer k (size constraint) • Subgraph H has at most k nodes • NP-hard
Algorithm • Two heuristics that can be used to find communities with bounded size • Inspired the Greedy algorithm for maximizing the minimum degree • GreedyDist, GreedyFast
Algorithm • GreedyDist • The tighter the distance constraint is, the smaller communities are
Algorithm • GreedyDist • Invoke GreedyGen • If the query nodes are connected but the size constraint is not satisfied, re-execute GreedyGen with a tighter distance constraint • Repeat until the size constraint is satisfied or the query nodes are disconnected
Algorithm • GreedyFast • Preprocess: the input graph is restricted to k’ closest nodes to the query nodes • Execute Greedy on the restricted graph • The closer a node is to the query nodes, the more related the node is to the query nodes, the more likely it is to belong to their community
Experiment Evaluation • DBLP • A coauthorship graph extracted from a recent snapshot of the DBLP database • 226K nodes, 1.4M edges • Tag • A tag graph extracted from the flickr photo-sharing portal • 38K nodes, 1.3M edges • BIOMINE • A graph extracted from the database of the Biomine project • 16K nodes, 491K edges
Quantitative Results • BASELINE: a simple and natural baseline algorithm • |Q|: the number of query nodes • d: distance bound • k: size bound • l: inter-distance between query nodes
Conclusion • Aim to find the compact community that contains the given query nodes and it is densely connected • Measurement based on constraints • Minimum degree • Distance • Size • Heuristics • GreedyGen • GreedyDist • GreedyFast