Discovery-Driven Graph Summarization

Discovery-Driven Graph Summarization ICDE 2010 Ning Zhang , Yuanyuan Tian , Jignesh M. Patel University of Wisconsin-Madison, IBM Almaden Research Center, USA Presented by Sung Eun, Park 9/26/2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business Technology Seoul National University Seoul, Korea

Contents • Introduction & Preliminaries • k-SNAP Summarization • Efficient Aggregation for Graph Summarization(Sigmod’08) • Categorization of Numerical Attributes • CANAL algorithm • Try to Merge Groups and Select Cutoffs • Automatic Discovery of Interesting Summaries • Measuring Interestingness of Summaries • Automatic Discovery of Interesting Summaries • Experimental Results • Conclusion

Introduction • Large Graph datasets are ubiquitous! • Graph summarization can assist in uncovering useful insights about the patterns hidden in the underlying data • Proposed Graph Summarization Approach (previous work)

Preliminaries • Proposed Graph Summarization Approach (previous work) • A summary graph by grouping nodes based on user-selected node attributes and relationships DB DB AI a3 a1 a2 5->LP 10->MP 30->HP DB AI OS DB AI a6 a4 a5 DB a2 a3 AI a5 23->HP 18->MP 28->HP a6 10->MP 30->HP 8->LP 28->HP

Preliminaries • Proposed Graph Summarization Approach (previous work) • SNAP : Nodes of each group are homogeneous with respect to user-selected attributes and relationships. • May result in a large number of small groups, in the worst case each node may end up an individual group. • k-SNAP : users can control the number of groups in the summary graph as k. → <k-SNAP: Top-down approach> < k-SNAP : different resoultion>

Preliminaries • k-SNAP • Low/Moderately/Highly Cited groups • Link represents the participation rate Strong relationship Weak relationship • participation rate=

Preliminaries • Δ-measure • How different it is to a hypothetical “ideal summary” • Given a graph G, the Δ-measure of a grouping of nodes Φ = {G1, G2, ..., Gk} is defined as follows: • Small Δ value indicates good summary For Every pair of groups , sum… Differences to the ideal summary

Introduction • k-SNAP • Produces summaries which themselves are also graphs • Two limitations • k-SNAP uses categorical node attributes, but even for domain experts, providing clear cutoffs is tricky and not always possible. →introduces CANAL alg’m that automatically categorize numerical attributes values based on both the attributes values and the link structures of nodes in the graph • k-SNAP allows summaries with different resolutions, but users may have to go through a large number of summaries until some interesting summaries are found. → propose a measure to assess the interestingness of summaries

Automatic Discovery of Interesting Summaries • CANAL algorithm • Input: Graph G, Numerical node attribute value a, Desired number category : C • Intuition : Find cutoffs that increases Δ value the most. • Every iteration.. Until there is only one group left • Pick the adjacent pair that has the most similar relationship pattern to the other groups • Merge the pair and calculate Δ increases after • Pick C-1 cutoffs that has the biggest Δ increases Calculate Δ increases nodes that containssame attribute value G1 G2 G3 G4 Gk Numerical node attribute Value … …

Automatic Discovery of Interestingness Summaries

Experimental Setting • Datasets • DBLP DB Dataset : Bibliography data • Coauthorship graph(undirected) • NODE : authors, 7,445 nodes • EDGE : coauthorship, 19,971 edges • Attirbute : publication number • CiteSeer Dataset • Citation graphs(directed) • NODE : article and a directed • EDGE : a citation • Attirbute : Number of citations

Experimental Result • Effectiveness of the CANAL algorithm • cutoffs generated by CANAL vs. cutoffs manually selected in the previous work • The cutoffs produced by CANAL results in Δ/k values that are very close to the manually selected ones. Δ measure (good summary has small value)

Experimental Result • Efficiency of the CANAL Algorithm • Execution time of the CANAL algorithm nicely scales with increasing data sizes • C=3, note that different C values do not significantly affect the run time of the CANAL algorithm, because all the cutoff candidates are considered.

Experimental Result • Effectiveness of the Interestingness Measure • Interestingness of summaries

Experimental Result • Two types of Interesting summaries Overall Summary • Conciseness is relatively small : k=4 • Coverage : the nodes participating to the strong relationship ↑& - • Diversity : ↑ as much as LP of size 2680 & -

Experimental Result Only strongly collaborate to HP • Two types of Interesting summaries Informative Summary Only show weak relationship • Conciseness : a little bigger than “overall summary” • Coverage : ↑ as much as new LP group of size 1261,- • Diversity : ↑ as much as new LP group of size 1261, -

Conclusion • Overcome two key limitations in the previous work • introduces CANAL algorithm that automatically categorize numerical attributes values based on both the attributes values and the link structures of nodes in the graph • propose a measure to assess the interestingness of summaries

Q&A Thank you

Discovery-Driven Graph Summarization

Discovery-Driven Graph Summarization

Presentation Transcript

Ontology Summarization Based on RDF Sentence Graph

Summarization

Data Driven Discovery in Science

Summarization of ego-centric video -Object driven Vs. Story driven

Summarization

LexRank : Graph-based Centrality as Salience in Text Summarization

Summarization

Summarization

Summarization

Dewey + DDC-driven resource discovery

SUMMARIZATION

Summarization

ONTOLOGY-DRIVEN DISCOVERY OF SCIENTIFIC COMPUTATIONAL ENTITIES

Unsupervised Constraint Driven Learning for Transliteration Discovery

Dynamic Faceted Search for Discovery-driven Analysis

Graph Summarization with Bounded Error

Summarization

ODD-Genes: Accelerating data-driven scientific discovery

Dewey + DDC-driven resource discovery

On Triangulation-based Dense Neighbourhood Graph Discovery

Video summarization by video structure analysis and graph optimization