200 likes | 465 Views
Discovery-Driven Graph Summarization. ICDE 2010. Ning Zhang , Yuanyuan Tian , Jignesh M. Patel University of Wisconsin-Madison, IBM Almaden Research Center, USA. Presented by Sung Eun, Park 9/26/2010. Intelligent Database Systems Lab. School of Computer Science & Engineering
E N D
Discovery-Driven Graph Summarization ICDE 2010 Ning Zhang , Yuanyuan Tian , Jignesh M. Patel University of Wisconsin-Madison, IBM Almaden Research Center, USA Presented by Sung Eun, Park 9/26/2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business Technology Seoul National University Seoul, Korea
Contents • Introduction & Preliminaries • k-SNAP Summarization • Efficient Aggregation for Graph Summarization(Sigmod’08) • Categorization of Numerical Attributes • CANAL algorithm • Try to Merge Groups and Select Cutoffs • Automatic Discovery of Interesting Summaries • Measuring Interestingness of Summaries • Automatic Discovery of Interesting Summaries • Experimental Results • Conclusion
Introduction • Large Graph datasets are ubiquitous! • Graph summarization can assist in uncovering useful insights about the patterns hidden in the underlying data • Proposed Graph Summarization Approach (previous work)
Preliminaries • Proposed Graph Summarization Approach (previous work) • A summary graph by grouping nodes based on user-selected node attributes and relationships DB DB AI a3 a1 a2 5->LP 10->MP 30->HP DB AI OS DB AI a6 a4 a5 DB a2 a3 AI a5 23->HP 18->MP 28->HP a6 10->MP 30->HP 8->LP 28->HP
Preliminaries • Proposed Graph Summarization Approach (previous work) • SNAP : Nodes of each group are homogeneous with respect to user-selected attributes and relationships. • May result in a large number of small groups, in the worst case each node may end up an individual group. • k-SNAP : users can control the number of groups in the summary graph as k. → <k-SNAP: Top-down approach> < k-SNAP : different resoultion>
Preliminaries • k-SNAP • Low/Moderately/Highly Cited groups • Link represents the participation rate Strong relationship Weak relationship • participation rate=
Preliminaries • Δ-measure • How different it is to a hypothetical “ideal summary” • Given a graph G, the Δ-measure of a grouping of nodes Φ = {G1, G2, ..., Gk} is defined as follows: • Small Δ value indicates good summary For Every pair of groups , sum… Differences to the ideal summary
Introduction • k-SNAP • Produces summaries which themselves are also graphs • Two limitations • k-SNAP uses categorical node attributes, but even for domain experts, providing clear cutoffs is tricky and not always possible. →introduces CANAL alg’m that automatically categorize numerical attributes values based on both the attributes values and the link structures of nodes in the graph • k-SNAP allows summaries with different resolutions, but users may have to go through a large number of summaries until some interesting summaries are found. → propose a measure to assess the interestingness of summaries
Automatic Discovery of Interesting Summaries • CANAL algorithm • Input: Graph G, Numerical node attribute value a, Desired number category : C • Intuition : Find cutoffs that increases Δ value the most. • Every iteration.. Until there is only one group left • Pick the adjacent pair that has the most similar relationship pattern to the other groups • Merge the pair and calculate Δ increases after • Pick C-1 cutoffs that has the biggest Δ increases Calculate Δ increases nodes that containssame attribute value G1 G2 G3 G4 Gk Numerical node attribute Value … …
Experimental Setting • Datasets • DBLP DB Dataset : Bibliography data • Coauthorship graph(undirected) • NODE : authors, 7,445 nodes • EDGE : coauthorship, 19,971 edges • Attirbute : publication number • CiteSeer Dataset • Citation graphs(directed) • NODE : article and a directed • EDGE : a citation • Attirbute : Number of citations
Experimental Result • Effectiveness of the CANAL algorithm • cutoffs generated by CANAL vs. cutoffs manually selected in the previous work • The cutoffs produced by CANAL results in Δ/k values that are very close to the manually selected ones. Δ measure (good summary has small value)
Experimental Result • Efficiency of the CANAL Algorithm • Execution time of the CANAL algorithm nicely scales with increasing data sizes • C=3, note that different C values do not significantly affect the run time of the CANAL algorithm, because all the cutoff candidates are considered.
Experimental Result • Effectiveness of the Interestingness Measure • Interestingness of summaries
Experimental Result • Two types of Interesting summaries Overall Summary • Conciseness is relatively small : k=4 • Coverage : the nodes participating to the strong relationship ↑& - • Diversity : ↑ as much as LP of size 2680 & -
Experimental Result Only strongly collaborate to HP • Two types of Interesting summaries Informative Summary Only show weak relationship • Conciseness : a little bigger than “overall summary” • Coverage : ↑ as much as new LP group of size 1261,- • Diversity : ↑ as much as new LP group of size 1261, -
Conclusion • Overcome two key limitations in the previous work • introduces CANAL algorithm that automatically categorize numerical attributes values based on both the attributes values and the link structures of nodes in the graph • propose a measure to assess the interestingness of summaries
Q&A Thank you