420 likes | 547 Views
Seminar 2009. Frequent Subgraph/ Substructure Mining. Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo. Outline . Introduction Apriori-based Subgrah Mining Pattern Growth Subgraph Mining Summary. Graphs are everywhere.
E N D
Seminar 2009 Frequent Subgraph/ Substructure Mining Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo
Outline • Introduction • Apriori-based Subgrah Mining • Pattern Growth Subgraph Mining • Summary
Graph Mining Problems • Graph Pattern Mining • Frequent subgraph pattern mining • Pattern summarization • Optimal graph patterns • Graph patterns with constraints • Approximate graph patterns …. • Graph Classification • Graph clustering • Important node identification • Bridge and hub identification • Other Important Topics • Graph compression • Graph model • Social network analysis.
Subgraph pattern Mining • Frequent subgraph • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold • Application of subgraph pattern mining • Mining biochemical structures • Program control flow analysis • Mining XML structures or Web communities • Building blocks for graph classifiction, clustering,compression, comparison and correlation analysis.
Frequent Subgraph Example C A A C B A B B A B C A A C A Support 1 3 3 subgraph (1) (2) (3)
Key Challenges in Subgraph Mining • Graph isomorphism • to detect if two graphs are identical in structure • Graph representation (Canonical Labeling) • A canonical label is a unique code of a given graph. • Canonical label should be the same no matter how graphs are represented, as long as graphs have the same topological structure and the same labeling of edges and vertices. • Subgraph candidate generation • generate candidate frequent subgraphs from datasets
Subgraph Mining Approaches • Apriori-based • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001 • PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) • FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) • FTOSM: Horvath et al. (KDD’06) • Pattern growth based • Subdue: Holder et al. (KDD’94) • MoFa: Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721 • Gaston: Nijssen and Kok (KDD’04) • CMTreeMiner: Chi et al. (TKDE’05) • LEAP: Yan et al. (SIGMOD’08)
Outline • Introduction and Background • Apriori-based Subgrah Mining • Pattern Growth Subgraph Mining • Summary
Apriori-based Approach • FSG : Frequent subgraph discovery. In ICDM’01, Nov. 2001M.Kuramochi and G. Karypis. • Flattened Representation as Canonical Labeling • Apriori-based method to generate subgraph candidate
Graph Representation in FSG • Flattened Representation
Graph Representation in FSG • Flatterned Representation Lexicographic order or dictionary order
Apriori-based method • Apriori Property • If a graph is frequent, all of its subgraphs are frequent. • Candidate Generation • Create a set of candidate size k+1 -from given two frequent k-subgraphs -containing the same (k-1)-subgraph -Result in several candidates size k+1
Apriori-based method • Graph candidate generated Example
Apriori-based method • FlowChart
Apriori-based method • Experiment Result -Chemical Compound Dataset, which contains 340 compounds,24 different atoms (vertices)
Outline • Introduction • Apriori-based Subgrah Mining • Pattern Growth Subgraph Mining • Summary
Motivation of gSpan • Weakness of Apriori-based approach • The generation of size (k+1) subgraph candidates from size k frequent subgraph too complicated and complex. • Pruning false positive : subgraph isomorphism is an NP complete problem which is costly. • gSpan: Graph-Based Substructure Pattern Mining • Change the way to represent a graph (DFS: Depth First Search) • Using pattern growth to generate new subgraph candidate.
gSpan: Graph-Based Substructure Pattern Mining • DFS (Depth First Search) Code • First Step: DFS the graph and use edges on the path to represent the graph. • Second Step: DFS Lexicographic Order • Pattern Growth subgraph generation
DFS code An edge is presented by 5 tuples.
DFS code • Second Step: DFS Lexicographic Order
Pattern Growth Approach • Pattern Growth (free extension)
Pattern Growth Approach • Duplicate Graphs
Pattern Growth Approach • Free extension
Pattern Growth Approach • Right most extension
Pattern Growth Approach • Exmaples (cont.)
Pattern Growth Approach • 340 molecules 66 atom types and 4 bond types as labels • On average only 27 vertices with 28 edges • Experimental result using Chemical data
Summary • Graph representation Flattern representation vs. DFS code • Generation of Candidate Patterns apriori vs. pattern growth
Frequent Graph Pattern Given a graph dataset D, find subgraph g, s.t. Where is the percentage of graphs in D that contain g. Problem 1 : Exponential Pattern Set Problem 2 : Threshold Setting
Difference between frequent itemset and frequent subgraph discovery
subgraph Mining Algorithms • Apriori-based approach – AGM/AcGM: Inokuchi, et al. (PKDD’00) – FSG: Kuramochi and Karypis (ICDM’01) – PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) – FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) – FTOSM: Horvath et al. (KDD’06) • Pattern growth approach – Subdue: Holder et al. (KDD’94) – MoFa: Borgelt and Berthold (ICDM’02) – gSpan: Yan and Han (ICDM’02) – Gaston: Nijssen and Kok (KDD’04) – CMTreeMiner: Chi et al. (TKDE’05) – LEAP: Yan et al. (SIGMOD’08)
Framework of subraph Mining Algorithms • Search Order breadth vs. depth complete vs. incomplete • Generation of Candidate Patterns apriori vs. pattern growth • Discovery Order of Patterns DFS order path tree graph • Elimination of Duplicate Subgraphs passive vs. active • Support Calculation embedding store or not
Frequent Subgraph Examples:
Subgraph Mining Approaches Apriori-based approach • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages 313-320, Nov. 2001 • PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) • FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) • FTOSM: Horvath et al. (KDD’06) Pattern growth approach • Subdue: Holder et al. (KDD’94) • MoFa: Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721 • Gaston: Nijssen and Kok (KDD’04) • CMTreeMiner: Chi et al. (TKDE’05) • LEAP: Yan et al. (SIGMOD’08)
Outline • Introduction and Background • Apriori-based Subgrah Mining • Pattern Growth Subgraph Mining • Summary DFS code Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721