gApprox: Mining Frequent Approximate Patterns from a Massive Network

gApprox: Mining Frequent Approximate Patterns from a Massive Network Chen Cheny, Xifeng Yanz, Feida Zhuy, Jiawei Han [ICDM 2007]reporter: Che-Wei, Liang10/16 1

Outline • Introduction • Problem Formulation • Algorithm • Pattern Space Exploration • Support Counting • Experiment • Conclusions 2

Introduction • A set of graphs vs. a single network • Recently, a large number of graphs with massive sizes and complex structures in many applications. • Biological networks, social networks, Web. • demanding powerful data mining methods. • Now interested in patterns that frequently appear at many different places of a singlenetwork. 3

Introduction △= degree of approximation = 5 Protein-Protein Interaction (PPI) network 4

Two major complications 1. Mining frequent patterns in a single network • Partition it into regions • Each contains one occurrence of the pattern 2. Due to various inherent noise or data diversity, it is crucial to account for approximationsso that all potentially interesting patterns can be captured. 5

Outline • Introduction • Problem Formulation • Algorithm • Pattern Space Exploration • Support Counting • Experiment • Conclusion 6

Problem Formulation 7

Approximate Pattern Occurrences • Injective function m: Vp → VG mapping each vertex v Vp to m(v) VG • Quantify the degree of approximation m incurs i.e., approximations can only happen within the matchable list. 8

Approximate Pattern Occurrences 9

Pattern Support with Approximation 12

Outline • Introduction • Problem Formulation • Algorithm • Pattern Space Exploration • Support Counting • Experiment • Conclusion 15

Algorithm • Two major issues: 1. Pattern Space Exploration 2. Support Counting • Enumerate approximate occurrences of each pattern in the network. • Decide the maximal number of disjoint occurrences. 16

Pattern Space Exploration • Decompose pattern space • Find all connected vertex sets in G that contain 1. • Remove 1 from G, and find all connected vertex sets in the new graph G’ that contain 2. • And so on so forth … 17

Pattern Space Exploration Example: Generating all connected vertex sets starting from 1.Stage1. Start from 1 and mark 1. Stage2. Expand from 1 to reach 2, 5, 6. Mark 2, 5, 6. There are totally seven connected vertex sets in this stage.{1,2}, {1,5}, {1,6}, {1,2,5}, {1,2,6}, {1,5,6}, {1,2,5,6} Stage3. Taking each of the seven connected vertex sets in stage 2 as a starting point, continue expansion. Stage4. Until there are no more unmarked vertices. 18

Theorem 1 Explore() in Algorithm 1 is both complete and redundancy-free, i.e., given a network G (1) it only generates connected vertex sets in G. (2) it can generate all connected vertex sets in G. (3) it does not generate the same connected vertex set more than once. 22

Support Counting A pattern P’s support is defined to be the maximal number of “disjoint” ones that can be chosen from P’s approximate occurrences in the network.— NP-Complete maximal independent set. Use algorithm 2 can provide an upperbound. 23

Support Counting 24

gApprox gApprox Combine with pattern space exploration and support counting. Conditional branch on the 3rd line of Algorithm 1’s DFS_horizontal() function. 25

Experiment 26

Conclusions Give an approximation measure and show its impact on mining. count a pattern’s support based on its approximate occurrences in the network. The techniques is general can be applied to networks from other domains. Can be modified to reach bigger, more interesting patterns even faster with some sacrifice on the completeness of mining results. 27

gApprox: Mining Frequent Approximate Patterns from a Massive Network