130 likes | 280 Views
Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism. Jun Huan, Wei Wang, Jan Prins ICDM 2003. Outline. Introduction Canonical Adjacency Matrix Join, Extension and Suboptimal CAMs SCAM Tree Conclusion. Introduction.
E N D
Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism Jun Huan, Wei Wang, Jan Prins ICDM 2003
Outline • Introduction • Canonical Adjacency Matrix • Join, Extension and Suboptimal CAMs • SCAM Tree • Conclusion
Introduction • Mining patterns from graph databases is challenging since graph related operation, such as subgraph testing, generally have higher time complexity than the corresponding operations on itemsets, sequences, and trees. • The problem of frequent subgraph mining is to find all frequent subgraphs from a graph database • Two challenging problem: • Subgraph isomorphism • An efficient scheme to enumerate all frequent subgraphs
Introduction • In this paper, we developed FFSM(Fast Frequent Subgraph Mining) targeting efficient subgraph testing and a better candidate subgraph enumeration scheme. The key features: (1) a novel graph canonical form and two efficient candidate proposing operations: FFSM-Join and FFSM-Extension (2) an graph framework(suboptimal CAM tree) to guarantee that all frequent subgraphs are enumerated unambiguously (3) avoiding subgraph isomorphism testing by maintaining an embedding set for each frequent subgraph
Canonical Adjacency Matrix(CAM) • In FFSM, every graph is represented By an adjacency matrix M (1) diagonal entry of M is filled with the label of the corresponding node (2) off-diagonal entry is filled with the label of the corresponding edge, or zero if there is no edge.
CAM • Given an n x n adjacency matrix M of a graph G with n nodes • Define the code of M denoted by code(M) • code(M) : the sequence of lower triangular entries of M(including entries on the diagonal) in the order m1,1, m2,1, m2,2…mn,1, mn,2,…mn,n-1, mn,n • We use standard lexicographic order on sequences to define a total order of two arbitrary codes • The canonical form is the maximal code among all its possible codes
Top : code(M1) = “axbxyb0yyb” >= code(M2)=“axboybxyyb” Bottom :
Join, Extension and Suboptimal CAMs • The current methods for enumerating all the subgraphs might be classified into two categories:join & extension • Join : a single join might produce multiple candidates and that a candidate might be redundantly proposed by many join operations • Extension : to restrict the nodes that a newly introduced edge may attach to • To achieve efficient subgraph enumeration: (1) Can we design a join operation such that every distinct CAM is generated only once? (2) Can we improve the join operation such that only a few(at most two)CAMs are generated from a single join operation? (3) Can we design an extension operation such that every edge might be attached to only one node in a graph represented by its CAM?
Join, Extension and Suboptimal CAMs • In order to tackle these challenges, we augment the CAM tree with a set of suboptimal CAM, and introduce two new operations : FFSM-Join and FFSM-Extension
Join, Extension and Suboptimal CAMs • At the bottom of Fig-2 we show a case in which a graph might be redundantly proposed by FSG(62) = 15 times. As shown in the graph, FFSM-Join completely removes the redundancy after “sorting” the subgraph by their CAM. • Suboptimal CAM (SCAM) def : given a graph G, and it’s CAM.SCAM is the submatrix of CAM and it can represent the subgraph of G p.s. proper SCAM : it isn’t a CAM
SCAM Tree • All SCAM of a graph G could be organized as a tree
SCAM Tree • SCAM Tree is “complete” that all nodes could be enumerated by either a join or an extension operation.
Conclusion • In this paper, it present a new algorithm FFSM which introducing two operations and a graph framework for reducing the number of redundant candidates • Experiment demonstrates that FFSM achieves a performance gain over the gSpan gSpan : build a new lexicographic order among graphs, and maps each graph to a unique minimum DFS code as its canonical label. Base on the lexicographic order,gSpan adopts the depth-first search to mine frequent subgraph efficiently.