320 likes | 450 Views
大規模幾何データからの高速な極大部分グラフ発見 Efficient Maximal Pattern Discovery from Massive Geometric Graphs. 有村博紀 北海道大学 大学院情報科学研究科 宇野毅明 国立情報学研究所 下薗真一 九州工業大学情報工学部. This work is partly supported by MEXT Grant-in-Aid for Scientific Research for Specially Promoted Research on “Semi-structured Data Mining”.
E N D
大規模幾何データからの高速な極大部分グラフ発見Efficient Maximal Pattern Discovery from Massive Geometric Graphs 有村博紀北海道大学 大学院情報科学研究科 宇野毅明国立情報学研究所 下薗真一九州工業大学情報工学部 This work is partly supported by MEXT Grant-in-Aid for Scientific Research for Specially Promoted Research on “Semi-structured Data Mining”
Backgrounds • Rapid growth of both the amount and the varieties of nonstandard datasets in scientific, spatial, and relational domains. • There are increasing demands for efficient methods to extract useful patterns and rules from weakly structured datasets. • Graph Mining…
Graph mining • Finding interesting subgraphsappearing in an input collection of labeled graphs. • One of the most promising approaches for knowledge discovery from weakly structured datasets. • A most popular approach is frequent subgraph mining[Inokuchi et al. 2000], but it can often generate a huge number of redundant subgraphs, which degrate the efficiency and the comprehensiveness very much. • How to cope with this proplem ...
Knowledge Discovery from Geometric Data • Network data with geometric information • Chemical compound with 2D or 3D information on their atoms and edges [Kuramochi and Karypis [ICDE’02] • CIty map with infrastructure information Geographic Information Systems (GIS) • VLSI layout with chips and wires • Geometric graphs ...
Geometric matching • P matches Q iff P is geometrically isomorphic to a subgraph of Q • Defined through the invariance under a class of “rigid” geometric transformations and graph isomorphism. y A A 2.0 A g g g g g A g g A A g g g A A 1.0 x 1.0 2.0 3.0
Maximal pattern discovery problem • A maximal pattern is a geometric graph which is not included in any properly larger subgraph having the same set of occurrences in D. • The maximal subgraph mining problem asks to find all maximal patterns (closed patterns) appearing in a given input geometric graph D without repetition • The set M of all maximal patterns is expected to be much smaller than the set F of all frequent patterns
Difficulties in maximal pattern mining • A number of efficient maximal pattern algorithms are proposed for sets, sequences, and graphs [3, 9, 20, 22, 25]. • Some algorithms use explicit duplicate detection and maximality test with a collection of already discovered patterns. • This requires large memory and delay time by these approaches, and introduces difficulties to use efficient search techniques, e.g., depth-first search. • Open problem: output-polynomial time computability of the maximal pattern problem for the class of geometric graphs.
Related works: Graph mining • Frequent subgraph mining: • AGM [Inokuchi, Washio, Motoda, PKDD’00] • TreeMiner [Zaki, KDD’02] • Freqt [Asai et al., SDM’02] • NK [Nijssen & Kok, MGTS’03] • Maximal/closed subgraph mining • CloseGraph [Yan & Han, KDD’03] • CMTreeMiner [Chi, Yang, Xia, Muntz, PAKDD’04] • Dryade [Termier, Rousset, Sebag, ICDM’04] • CloAtt [Arimura & Uno, ILP’05] • Combination with machine learning • XRule [Zaki & Aggrawal, KDD’03] • Weighted Substructure Mining [Tsuda & Kudo, ICML’06]
Related works: Maximal/Closed pattern mining • 1. The first: Flexible Patterns • Classes of “elastic” or “flexible” patterns • Polynojmial delay and space algoarithms are developed using a very simple “reverse search property”holds • CMTreeMiner [Chi et al. PAKDD’04], BIDE [Yan & HanICDE’04], and MaxFlex [Arimura & Uno, LLLL’07] • The second: Rigid patterns • deal with mining of “rigid” patterns which have • Polynojmial delay and space algoarithms based on the existence of least general generalization or closure-like operations. • LCM [Uno et al. FIMI’03,’04, DS’04] proposes ppc-extension for maximal sets, and then CloATT [Arimura & Uno ILP’05] and MaxMotif [Arimura & Uno ISAAC’05] • The third: others • Heuristic algorithms • CloseGraph [25]: frequent pattern discovery augmented with maximality test and the duplicate detection • Difficult to achieve output-polynomial time computability
Def: Enumeration Algorithms • Efficient data mining algorithm = output-polynomial time algorithms
Algorithm MaxGeo • A time and space efficient algorithm for mining all maximal geometric subgraphs • Depth-first search over the space of all maximal geometric subgraphs • To do this ... • Achieves first time polynomial delay and polynomial space
We develop techniques... • A polynomial time computable canonical codefor all geometric graphswhich is invariant under geometric transformations. • Characterization of M by the intersection operation (the least general generalization) and then Polytime computable closure operation for geographs • The tree-shaped search route T for all maximal patterns in G • A new pattern growth technique combining reverse search and closure extension
Main result • Theorem: • Given an input geometric graph D, algorithm MaxGeo enumerates all frequent maximal patternPin Mwithout duplicates in O(m(m+n)||D||2log ||D||) = O(n8log n) time per pattern and in O(m) = O(n2)space, • with the maximum number m of occurrences of a pattern other than trivial patterns, the number n of vertices in D, and the number ||D|| of vertices and edges in D. • Corollary: • The maximal pattern enumeration problem is solvable in polynomial delay and polynomial space in the total input size.
Geometric graph (Geograph) Geograph G • A vertex- and edge-labeled graphG = (V, E, l,m; c) • Having vertex labels l(v) and edge labels m(e) • which represent geometric features and their relationships • Whose vertices v have the coordinatesc(v) in the 2D plane R2 y • vertex v in V • l(v) = A • c(v) = (2.5, 1.0) A A 2.0 f g e g e A g f • edge e in E • m(e) = g Alphabets SV = {A, B, C} SE = {a, b} A A 1.0 1.0 2.0 3.0 x
Basics in Geometry • R2: 2-dim Euclidean space • The set R2 of all points p = (x, y) (x, y : real numbers) • ||x|| : the norm of a vector x • ||x - y|| : the distance of xand y • c x : a scalar product • x + y: the addition of vectors • Ax: the product of matrix A and a 2-vector x • det(A) : the determinant of matrix A • A-1 : the inverse of matrix A • f : R2R2: a geometric transformations • f(x) = Ax + b: an affine transformation
Geometric Isomorphism • Geograph P is geometrically isomorphic to Q iff there exists some F in Tgeo such that T(P) = Q • Class Tgeoof Geometric Transformations: Any combinations F of : • Translation M • Rotation R • Scaling S
Geometric matching • P matches Q iff P is geometrically isomorphic to a subgraph of Q • Defined through the invariance under a class of “rigid” geometric transformations and graph isomorphism. geometric matching function F y A A 2.0 A A g g g g g A g g g g A A A A g g g g A A 1.0 x 1.0 2.0 3.0
Geometric matching • P matches Q iff P is geometrically isomorphic to a subgraph of Q • Defined through the invariance under a class of “rigid” geometric transformations and graph isomorphism. y A A A 2.0 A g g g g g g e A A g g A A g g g g A A A 1.0 x 1.0 2.0 3.0
Geometric matching • P matches Q(P ≦ Q) iff • P is geometrically isomorphic to a subgraph of Q via geometric graph isomorphism under rigid geometric transformations. • (Geo, ≦) : A partial ordering over geographs y A A 2.0 A g g g g g A g g A A g g g A A 1.0 x 1.0 2.0 3.0
Occurrence and frequency • The location list L(P) of geograph P in the input geograph D • the set of all geometric transformations that matches P to D. • The frequency of P in D: freq(P) = |L(P)| Database D y Pattern P A A 2.0 A g g g g g A g g A A L(P) = {f1, f2, f3} freq(P) = 3 g g g A A 1.0 x 1.0 2.0 3.0
Equivalence of pattern • Two geographs P and Q are equivalent each other in D if L(P) = L(Q) holds in D. A Pattern Q Pattern P g A A g A g g A A L(Q) = {f1, f2, f3} freq(Q) = 3 L(P) = {f1, f2, f3} freq(P) = 3 g
Maximal patterns • A maximal pattern • A geometric graph which is not included in any properly larger subgraph w.r.t. ≦having the same set of occurrences in D. • A maximal element within the equivalence class of geographs w.r.t. location list equivalence. • Lemma 1(unique maximal pattern)For any geometric pattern P, there exists the unique maximal pattern equivalent toP • Proof: Take the intersection of all geographs in the equivalence class [P] = { Q in Geo : L(P) = L(Q) in D }. This is the unique maximal patterns equivalent to P. QED.
Maximal pattern mining • A maximal pattern • is a geometric graph which is not included in any properly larger subgraph having the same set of occurrences in D. • The maximal subgraph mining problem • asks to find all maximal patterns (closed patterns) appearing in a given input geometric graph D without repetition • The set M of all maximal patterns • is expected to be much smaller than the set F of all frequent patterns and still contains the complete information of D 要修正!
Canonical form • Given a geograph P of size k • Define the canonical codeCano(P) of P as the lexicographically smallest code C(P, N) for all numbering N, where • C(P, N) is defined as follows • Determine a numbering Nof all the vertices of P in 1, 2, 3, ..., k • Sort the collection of all labeled verticies and labeled edges: • (c(v), l(v) for all v in V • (c(u), c(v), m(u,v)) for all edges e = (u,v) in E • Let C(P, N) be the resulting list as the code by N 要修正!
Intersection of geometric graphs Lemma: The intersection of geographsT1 and T2 is the unique geograph Merge(T1, T2) = T whose object sets is given by α(T) = α(T1)∩α(T2). α(G) = The object set of G, that is, the set of all labeled vertices and labeled edges in a geometric graph G α(Merge(T1, T2)) Merge(T1, T2) α(T2) α(T1) T1 T2 ILP'05, Aug 2005, Hiroki Arimura, Hokkaido University
Closure of geograph • The intersectionMerge(P1, P2) of a pair of geographs P1 and P2 • The intersection of P1 and P2 as the first order (relational) structure • The closure of geograph P • Closure(P) = Merge(L(P)) • Theorem: • P is maximal in D iff CLosure(P) = P 要修正!
Tree-shaped search route for maximal patterns • The core of (the code of) a geograph P • The shortest prefix core(P) of code(P) such that L(P) = L(core(P)) • The parent P of maximal geograph Q • Parent(P) = Closure(the proper prefix of core(the code of P)) • Theorem: The graph Tree(Geo) = (Geo, Parent(.)) forms a spanning tree for Geo with the empty geograph as root 要修正!
Our algorithm MaxGeo: Basic Idea • Depth-first search over a tree-shaped search space for all maximal gegraphs • Jumping from one maximal geograph to another maximal geograph Tree(Geo) = (Geo, Parent(.)) ILP'05, Aug 2005, Hiroki Arimura, Hokkaido University
Main result • Theorem: • Given an input geometric graph D, algorithm MaxGeo enumerates all frequent maximal patternPin Mwithout duplicates in O(m(m+n)||D||2log ||D||) = O(n8log n) time per pattern and in O(m) = O(n2)space, • with the maximum number m of occurrences of a pattern other than trivial patterns, the number n of vertices in D, and the number ||D|| of vertices and edges in D. • Corollary: • The maximal pattern enumeration problem is solvable in polynomial delay and polynomial space in the total input size.
Summary: We develop techniques... • A polynomial time computable canonical codefor all geometric graphswhich is invariant under geometric transformations. • Characterization of M by the intersection operation (the least general generalization) and then Polytime computable closure operation for geographs • The tree-shaped search route T for all maximal patterns in G • A new pattern growth technique combining reverse search and closure extension
Conclusion • The class of geometric graphs • Maximal pattern discovery problem • A polynomial space and polynomial delay algorithm MaxGeo • Time and space complexity • Techniques