TopK Interesting Subgraph Discovery in Information Networks

TopK Interesting Subgraph Discovery in Information Networks Manish Gupta Jing GaoXifeng Yan Hasan Cam Jiawei Han gmanish@microsoft.com

Real World Problems Network Bottlenecks Discovery Computer Networks Organization Networks Team Selection Interestingness = Highest Historical Compatibility Interestingness = Lowest Bandwidth Suspicious Relationships Discovery Battlefield Networks Resource Allocation Social Networks Interestingness = Highest Negative Association Strength of Attribute Values Interestingness = Lowest Distance between Entities gmanish@microsoft.com

The Basic Underlying Problem Team Selection Network Bottlenecks Discovery Interestingness = Lowest Bandwidth Interestingness = Highest Historical Compatibility Suspicious Relationships Discovery Resource Allocation Interestingness = Highest Negative Association Strength Interestingness = Lowest Distance • Given • Edge-weighted Typed Network G • Typed Subgraph Query Q • Edge Interestingness measure • Find • TopK matching subgraphs gmanish@microsoft.com

Naïve Solution: Ranking After Matching 4 3 2 1 A A A B 0.8 0.7 0.2 12 13 0.2 Network G Query Q C C 0.4 0.5 0.4 0.3 6 5 4 3 2 1 2 3 6 5 4 4 3 3 2 B A A A A A A B Ranking 0.6 0.8 0.8 0.7 0.2 B A A A A A A Why compute all matches? We need only top-2! 0.6 0.8 0.8 0.8 0.7 0.9 0.1 0.7 0.1 10 9 8 7 0.7 11 1 4 10 9 8 7 B C A A A B A 0.3 0.6 0.5 0.2 A A A B B 0.3 0.6 0.5 Matching 4 3 2 6 5 A A A B A 0.8 0.7 0.6 0.1 0.9 7 10 9 5 B 6 5 A A 0.3 4 5 A A A B A 0.8 0.6 0.9 0.9 0.1 0.9 9 8 7 9 7 9 8 A A B A B 7 A A 0.6 0.5 0.6 gmanish@microsoft.com

Our Contributions • New notion: TopK interesting subgraph detection in information networks • Three new low-cost indexes • Graph topology index • Sorted edge lists • Graph maximum metapath weight index • Novel top-K algorithm to answer interestingness queries on large graphs • Detailed effectiveness and efficiency validation on several synthetic and real datasets gmanish@microsoft.com

Relationship with Previous Work • Subgraph matching • Approximate: fuzzy node/edge similarity • Exact: Matching without ranking • RDF graphs, probabilistic graphs, temporal graphs • TopK querying on graphs • H-hop aggregate queries • Keyword queries on RDF graphs • K most frequent patterns • Twig queries gmanish@microsoft.com

System Overview 2 Network G Breadth First Traversal from each Node up to Distance D Graph Topology Index Offline Index Construction Distance D Sort Edges 3 Graph Maximum MetaPathWeight Index 1 Sorted Edge Lists Find Candidate Nodes Query Q Candidate Nodes Top-K Computation Online Query Processing Top-K Subgraphs gmanish@microsoft.com

G=(V,E), B=avg #neighbors, T=#types Index Structures 12 13 0.2 Network G C C 0.4 0.5 0.4 0.3 6 5 4 3 2 1 B A A A A B 0.6 0.8 0.8 0.7 0.2 0.9 0.1 0.7 0.1 10 9 8 7 11 C A A A B 0.3 0.6 0.5 0.2 gmanish@microsoft.com

Find Candidate Nodes Graph Topology Index Query Q Query Q Graph Topology Index 2 3 A A 1 4 B A Query Topology gmanish@microsoft.com

Finding and Scoring MatchesKey Idea Query Q Top-K Computation 2 3 Start Y Generate a Size-1 Candidate A A More valid edges? N 1 4 Y B A TopK Quit? Compute Actual and UB Score N Y N Candidate Size==|Q|? B A A A Grow Candidates N Y Y Top-K Heap TopK Quit? Compute Actual and UB Score Update Heap Compute Max UB Score N Y TopK Quit? Done! gmanish@microsoft.com

Finding and Scoring MatchesGenerating Size-1 Candidates Size-1 Candidates Query Q 9 9 2 9 5 5 9 9 9 9 3 5 5 5 5 5 5 9 A A A A A A A A A A A A A A A A A A A A 5 1 9 4 B B B B B B B B B B A A A A A A A A A A Query Edge with both endpoints of same type Multiple query edges of the same type Candidate Growth B A A A Order (5,9) (3,4) (4,5) (2,3) (2,7) … Heapify? Discard? Prune? Grow? 8 6 6 10 Prune? Grow? 8 10 Heapify? Discard? Prune? Grow? gmanish@microsoft.com

Finding and Scoring MatchesActual Score and Upper Bound Score Candidate Growth 9 9 9 9 5 5 5 5 Prune? Grow? Prune? Grow? Heapify? Discard? 6 8 8 A A A A A A A A B B B B A A A A Actual Score= 0.9 B A A A UB Score = 0.9+ UB(NonConsidered Edges) = 0.9+ (0.6+0.6) = 2.1 • Partially grown candidate • Prune if UBScore< min(heap) • Grow otherwise • Fully grown candidate • Discard if UBScore< min(heap) • Update heap otherwise Useful Edge Lists gmanish@microsoft.com

Finding and Scoring MatchesGlobal Top-K Quit 12 13 0.2 Network G C C Query Q 0.4 0.5 0.4 0.3 6 5 4 3 2 1 2 3 B A A A A A A B 0.6 0.8 0.8 0.7 0.2 0.9 0.1 0.7 1 4 0.1 10 9 8 7 11 B A C A A A B 0.3 0.6 0.5 0.2 B A A A K=2 TopK Heap (4,3,2,7): 2.2 (3,4,5,6): 2.2 Stop 0.7+0.6+0.7 = 2 <2.2 gmanish@microsoft.com

Faster Query Processing using Graph Maximum MetaPath Weight Index Slight complication 1 1 1 4 3 5 C 4 3 5 C C A B C A B C 2 2 2 C C C Query 6 7 1 B C Query Partial Instantiation UB Score = Actual Score(1-2) + UB(1-3) + UB(2-3) + UB(3-4) + UB(4-5) C 1 4 3 5 C 2 C 4 A B C B Partial Candidate 7 3 6 7 C A UB Score = Actual Score(1-2) + UB(1-3-4-5) + UB(2-3) 2 B C 1 C 4 3 5 C Paths to cover Non-Considered Edges Edges to Consider Separately A B C 3 Paths to cover Non-Considered Edges A UB Score = Actual Score(1-2) + UB(1-3-4-5-7) + UB(2-3) + UB(4-6) +UB(6-7) 2 Using MMW Index! C gmanish@microsoft.com

Faster Query Processing using Graph Maximum MetaPathWeight Index 5 A A Prune? Grow? 9 B A Edge-based UBScore 0.9+0.8+0.7 =2.4 > 2.0 B A A A Grow K=2 TopK Heap (8,9,5,6): 2.1 (5,9,8,7): 2.0 Path-based UBScore 0.9+UB(5-A-B) =0.9+0.9 =1.8 < 2.0 Prune MMW Index gmanish@microsoft.com

Discussions • Queries with multiple edge semantics • Directed graphs • Homogeneous networks • Weighted query edges • Weights signify expected amount of interestingness • Weights signify importance of query edge • Faster computations versus index size gmanish@microsoft.com

Low-cost Index Structures gmanish@microsoft.com

Faster Query Execution Query Execution Time (msec) for Clique Queries (Graph G2 and indexes with D=2) Query Execution Time (msec) for Path Queries (Graph G2 and indexes with D=2) RAM: Ranking After Matching baseline RWM0: without using the candidate node filtering RWM1: without using the MMW index RWM2: same as RWM1 without the pruning any partially grown candidates RWM3: same as RWM1 without the global top-K quit check RWM4: same as RWM1 with the MMW index Query Execution Time (msec) for Subgraph Queries (Graph G2 and indexes with D=2) gmanish@microsoft.com

Good Scalability Good Scalability thanks to Effective Pruning Running time (msec) for different Query Sizes and Graph Sizes (D=2) Number of Candidates as Percentage of Total Matches for Different Query Sizes and Candidate Sizes Query Execution Time for Different Values of K gmanish@microsoft.com

Real Dataset Case Studies 2 2 4 1 1 Author Conf Author Conf Keyword 3 3 Author Author Q1 Q2 2 2 4 1 1 Person Film Person Company Settlement 3 3 Person Person Q3 Q4 gmanish@microsoft.com

Real Dataset Case Studies • DBLP • 1: Rohit Gupta, 2: BICoB, 3: Vipin Kumar • Rohit Gupta -- computer networking • Vipin Kumar -- Data and Information Systems • BICoB -- International Conference on Bioinformatics and Computational Biology • 1: Jimeng Sun, 2: Operating Systems Review (SIGOPS), 3: Christos Faloutsos, 4: mining • Jimeng Sun and Christos Faloutsos -- Data and Information Systems, Artificial intelligence, and Computational biology • "mining" -- Data and Information Systems • "Operating Systems Review (SIGOPS)" -- Operating systems, Computer architecture, Computer networking gmanish@microsoft.com

Real Dataset Case Studies • Wikipedia • 1: Stacy Keach, 2: The Biggest Battle, 3: John Huston • Stacy Keach and John Huston starred in the movie “The Biggest Battle” • Stacy Keach (American), John Huston (American), movie is Italian • Stacy (narration, comedy, music), John (drama, documentary, adventure), movie (war) • 1: Medha Patkar, 2: BBC, 3: Felix D’Alviella, 4: Mogilino • Medha Patkar -- Indian social activist -- won Best International Political Campaigner by BBC • Felix D’Alviella -- Belgian actor in the BBC soap opera Doctors • Mogilino -- village in Bulgaria -- BBC showed the popular film "Bulgaria’s Abandoned Children" in 2007 • British company rewarding an Indian woman, covering a place in Bulgaria or linked to a person from Belgium is rare gmanish@microsoft.com

Related Work (1) Theory literature on subgraph isomorphism [Cordella et al., 2004; McKay, 1981; Ullmann, 1976] Exact subgraph matching [Cheng et al., 2008; He and Singh, 2008; Sun et al., 2012; Zhang et al., 2007; Zhang et al., 2009; Zhao and Han, 2010; Zou et al., 2009] Approximate subgraph matching [Zou et al., 2007; Zeng et al., 2012; Tian et al., 2007; Zhang et al., 2010] gmanish@microsoft.com

Related Work (2) • Matching in graph databases [Ranu and Singh, 2009; Yan et al., 2005; Zhu et al., 2012] • Matching for RDF graphs [Liu et al., 2012], probabilistic graphs [Yuan et al., 2012] and temporal graphs [Bogdanov et al., 2011] • Top-K queries • h-hop aggregate queries [Yan et al., 2010] • K most frequent patterns [Yang et al., 2012; Zhu et al., 2011] • Top-K keyword queries on RDF graphs [Tran et al., 2009] • Top-K similarity queries [Zou et al., 2007] • Twig queries [Gou and Chirkova, 2008] gmanish@microsoft.com

Conclusion • Given • Typed unweighted query • A heterogeneous edge-weighted information network • Edge interestingness measure • Find • Top-K interesting subgraphs • Investigated ranking after matching baseline • Proposed three new graph indexes and exploited them for building a top-K solution • Showed efficiency, scalability and effectiveness on multiple synthetic and real datasets gmanish@microsoft.com

Thanks! gmanish@microsoft.com

TopK Interesting Subgraph Discovery in Information Networks

TopK Interesting Subgraph Discovery in Information Networks

Presentation Transcript

Resource Discovery in Self-Organizing Networks

Relevant Subgraph Extraction

Dynamic Discovery in Wireless Networks

Subgraph Isomorphism in Graph Classes

Interesting Interval Discovery on Spatiotemporal Datasets

Secure Neighbor Discovery in Wireless Networks

Information diffusion in networks

Information Discovery, Brokerage, and Dissemination in Sensor Networks

Distributed Information Discovery

Information diffusion in networks

Discovery of transcription networks

Discovery of Interesting Spatial Regions

Autonomic Networks Service Discovery

perfSONAR Information Discovery

Computational Discovery in Evolving Complex Networks

INTERESTING INFORMATION ABOUT ME

Hierarchical Floorplanning of Chip Multiprocessors using Subgraph Discovery

PathSim : Meta PathBased TopK Similarity Search in Heterogeneous Information Networks

Service Discovery in Wireless Networks

Computational Discovery in Evolving Complex Networks

perfSONAR Information Discovery