230 likes | 256 Views
This paper presents a method for mining diverse substructures from a large class-labelled graph database using backbone refinement classes.
E N D
05/2009 Large-Scale Graph Mining Using Backbone Refine-ment Classes Andreas Maunz1,Christoph Helma1,2, andStefan Kramer3 1) FDM Universität Freiburg (D)2) in-silico toxicology Basel (CH) 3) Technische Universität München (D)
Efficient diverse substructure mining from a large class-labelled graph database BACKBONE REFINEMENT CLASS MINING
BBRC Rationale Typical substructure frequencies for databases of small molecules Trees are most frequent substructure type; yet efficiently enumerable. However: • Excessively large result sets are obtained even for high correlation and minimum frequency constraints. 04
BBRC Definitions GASTON (GrAph, Sequence and Tree ExtractiON) by Nijssen and Kok1: • Backbone of a tree: longest path with the lowest sequence (assuming canonical sequence ordering). • Since every tree has exactly one backbone, backbones partition the partial order of trees disjointly. • Pre-order (depth-first) traversal is used within each partition to refine structures. Backbone Refinement Class (BBRC):All tree refinements growing from a specific backbone. 1 Nijssen S. & Kok J.N.: “A Quickstart in Frequent Structure Mining can make a Difference”, KDD ’04: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM 2004: 647–652. 04 4
BBRC Example Class 1 Class 2 Refinement Backbone: c:c:c-C=C-O-C C-C(-O-C)(=C-c:c:c) Refinement C-C(=C(-O-C)(-C))(-c:c:c) C-C(=C-O-C)(-c:c:c) Backbones in gray 04 5
BBRC Properties (1) Some Properties • Two types of BBRCs: • within a backbone: not disjoint(see figure on the left) • across backbones: disjoint • A given backbone spans a maximum search tree. No node may be added without changing the backbone. • BBRCs partition the search space structurally (as opposed to occurrence-based methods, such as open/closed features). Search space for two BBRCs within the same backbone. 04 6
BBRC Properties (2) The Number of BBRCs Consider the special case of a rooted perfect binary tree of height h. Backbone with branches in gray Perfect binary tree of height 3 → The number of Backbone Refinement Classes is governed by the (recursive)branches on this backbone. 04 7
BBRC Properties (3) The Number of BBRCs (unpublished) The number of backbone refinement classes of a branch of length l is The set of BBRCs containing the root has size The full set of subtrees containing the root has size [1] where q~1.50284. 1 L.A. Szekely, Hua Wang, On subtrees of trees, Advances in Applied Mathematics, Volume 34, Issue 1, January 2005, Pages 138-155, 04 8
BBRC Properties (4) Summary of Feature Counts 04 9
BBRC Implementation Idea: Use paths as candidate backbones. Mine BBRCs and represent each BBRC by the most (2-)significant member. • In case of several most significant members, use the most general one. • 2 thresholds can not be used for anti-monotonic pruning, however an upper bound for 2 values of refinements of a pattern exists1 (Statistical Metric Pruning). Dynamic Upper Bound Pruning: 2 threshold may be increased during depth-first traversal since we only search for the max. elements of classes. 1S. Morishita and J. Sese. Traversing Itemset Lattices with Statistical Metric Pruning. In Symposium on Principles of Database Systems, pages 226–236, 2000. 04 10
BBRC Experiments (1) Investigation of BBRCs regarding time efficiency, feature set sizes and expressiveness • Significant Trees: all trees that are frequent and significant. • Class-Balanced CPDB datasets: • Salmonella Mutagenicity (SM, 388 active / 810 compounds) • Rat Carcinogenicity (RC, 459 active / 1145 compounds) • Mouse Carcinogenicity (MoC, 428 active / 927 compounds) • Multicell Call (MuC, 553 active / 1067 compounds). • Open Trees[1]: most general significant trees with the same occurrences. • BBRC Representatives: most significant representatives of the backbone refinement classes. 1 B. Bringmann, A. Zimmermann, L. de Raedt, and S. Nijssen. Don’t Be Afraid of Simpler Patterns. In Proceedings 10th PKDD, pages 55–66. Springer-Verlag, 2006. 04 11
BBRC Experiments (2) Feature Set Sizes Minimum frequency: 6 04 12
BBRC Experiments (3) Time Efficiency Minimum frequency: 6 04 13
BBRC Experiments (4) Instance-based predictions all: all predictions AD: top 80% confidence predictions wt.: predictions weighted by confidence Accuracy, Sensitivity, Specificity Black: Sign. Trees Dark Gray: BBRC-R. Light Gray: Open Trees 04 14
Large-Scale Analysis (1) Large Scale Analysis NCI Yeast Anticancer Drug Screen datasets (April 2002 release) AC-One (stage 0): 87,264 compounds, 12,068 active AC-All (stage 0): 87,264 compounds, 5,777 active AC-All (stage 1): 10,924 compounds, 5,433 active To the best knowledge of the authors, 1. and 2. are the largest labelled datasets that have been considered in correlated graph mining. 04 15
Large-Scale Analysis (2) Effects of Minimum Frequency on Dataset Coverage AC-One (stage 0): 87,264 comp: Similar results were obtained for the other datasets*. BBRC descriptors are more probable in lighter regions. * The effects of not using aromatic perception, i.e. no special node and edge labels for aromatic bindings, were much greater. The number of descriptors per compound in this setting was > 80 for both thresholds. 04 16
Large-Scale Analysis (3) Feature Count for Balanced datasets (downsampling) Max. Trees: the positive border as implied by minimum frequency and significance constraints[1]. 1 M. Al Hasan et.al. Origami: Mining Representative Orthogonal Graph Patterns. ICDM 2007. Seventh IEEE International Conference on Data Mining, pages 153–162, Oct. 2007. 04
Large-Scale Analysis (4) Time Efficiency Accuracy Time efficiency (Mining) Open Trees:mining times of 4-12h all: all predictions AD: top 80% confidence predictions wt.: predictions weighted by confidence Time efficiency (Prediction) Open Trees:prediction times of >60simpractical RAM demand. 04 18
BBRC Experiments (5) Euclidean embedding based on Co-Occurrences and Entropy[1] Active / Inactivecompounds Activating / Deactivatingfeatures Differently colored features nearly perfectly separated Features are well distributed with few clusters 1 Hannes Schulz, Christian Kersting, Andreas Karwath, ILP, the Blind, and the Elephant: Euclidean Embedding of Co-Proven Queries (Proceedings of the 19th International Conference on Inductive Logic Programming (ILP 2009) (forthcoming)). 04 19
Summary (1) Backbone Refinement Class Representatives • Structurally heterogeneous descriptors, compression by structural invariant (backbone constraint) • Good dataset coverage, robust against increasing minimum frequencies • Applicable to large-scale graph databases through a novel statistical pruning technique 04
Summary (2) Backbone Refinement Class Representatives • Compression of 90% compared to all trees and 31% compared to open trees • Time efficiency improved by 85% and 83% versus no statistical pruning and static upper bound pruning, respectively. • Discriminative potential similar to complete set of trees, but significantly better than open trees. 04
Acknowledgements The authors would like to thank Björn Bringmann for providing a binary and friendly cooperation in dataset testing, and Ulrich Rückert for providing datasets. The research was (partially) supported by the EU seventh framework programme under contract no Health-F5-2008-200787 (OpenTox). http://www.opentox.org C++ implementation: http://www.maunz.de/libfminer-doc 04