660 likes | 749 Views
Mining for Tree-Query Associations in a Graph. Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline Hoekx (U Hasselt, Belgium). Graph Data.
E N D
Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline Hoekx (U Hasselt, Belgium)
Graph Data A (directed) graph over a set of nodes N is a set G of edges: ordered pairs ij with ij N. Snapshot of a graph representing the metabolic pathway of a human. Applications: life sciences, biology, social sciences, WWW, ...
Graph Mining Transactional category • dataset: set of many small graphs (transactions) • frequency: transactions in which the pattern occurs (at least once) • ILP:Warmr [AGM, FSG, TreeMiner, gSpan, FFSM, Horvath-Ramon-Wrobel] Single graph category • dataset: single large graph • frequency: copies of the pattern in the large graph [Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom] Focus on pattern mining, few work on association rule mining!
Tree-Query Pattern • powerful tree-shaped pattern • inspired by conjunctive database queries • special features: • existential nodes • parameterized nodes • occurrence of the pattern in G is any homomorphism from the pattern in G frequency:x z:0zGz8 Gzx G
Association rules • fully fledged associations over tree-query patterns • example:
Experimental results: Real-life datasets • Food webnodesedges confidence = 89% frequency = 176
Experimental results: Real-life datasets • Food webnodesedges confidence = 89% frequency = 176
Experimental results: Food web nodesedges 45% 55%
Experimental results: Real-life datasets • Protein interactions graph nodesedges confidence = 10%
Experimental results: Protein interaction graphnodesedges 90%
Outline rest of the talk • Formal problem definition • Algorithm • overall approach • levelwise generation of tree patterns • generation of containment mappings • generation of parameter assignments • Equivalent association rules • Certhia • Performance and Experimental results • Future work
Tree pattern select distinct G3.to as x from G G1, G G2, G G3 where G1.from=5 and G1.to=G2.from and G1.to=G3.from and G2.to=8
Frequency frequency = 3
Tree Query H, head P, body • Q = (H,P)
Association Rule • AR: Q1 Q2 • Confidence (AR) = freq(Q2)/freq(Q1) • Q2 Q1 { (x1,x2,x3) | Q1(x1,x2,x3) G} { (x,x,6) | Q2(x,x,6) G }
Examples of Association Rules (1) (2)
Association Rule • AR: Q1 Q2 • Confidence (AR) = freq(Q2)/freq(Q1) • Q2 Q1 { (x1,x2,x3) | Q1(x1,x2,x3) G} { (x,x,6) | Q2(x,x,6) G }
Containment Mapping containment mapping
Containment Mapping containment mapping
Containment Mapping containment mapping
Containment Mapping containment mapping
Containment Mapping containment mapping Q Q containment mapping fromQ to Q
Problem statement: Mining tree queries Given a graph G and a threshold k, find all tree queries that have frequency at least k in G, those queries are called frequent.
Problem statement: Association rules • Input: • a graph G • minsup • Qleft frequent in G • minconf • Output: All association rules QleftQ • frequent in G • confident in G.
Algorithm: mining tree queries x1 x2 x4 x3 x x2 x2 x1 x1 Outer loop: Generate,incrementally, all possible trees of increasing sizes. Avoid generation of isomorphic trees. Inner loop: For each newly generated tree, generate all queries based on that tree, and test their frequency. ...
Outer loop • It is well known how to efficiently generate all trees uniquely up to isomorphism • Based on canonical form of trees. • [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]
Inner loop: Levelwise approach • A query Q is characterized by • Q set of existential nodes • Q set of selected nodes • Labeling Qof the selected nodes by constants. • Qspecializes Q if , and agrees with on . • If Qspecializes Q then freqQ freqQ • Most general query: T = (, , )
Inner loop: Candidate generation • CanTab is a candidate query FreqTabis a frequent query • Q’=’’ is aparent of Q= if either: • ’ and has precisely one more node than ’, or • ’ and has precisely one more node than ’ • Join Lemma: Each candidacy table can be computed by taking the natural join of its parent frequency tables.
Inner loop: Frequency counting • Each candidacy table can be computed by a single SQL query. (ref. Join lemma). • Suppose: Gfromto table in the database, then each frequency table can be computed with a single SQL query. • • formulate in SQL and count • • formulate in SQLE • natural join of E with CanTab • group by • count each group
Inner loop: Example x x x x x
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx⋈FreqTabxx
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx
Inner loop: Example x x x x x • SQL expression E for x select distinct G1.from as x1, G2.to as x3, G3.to as x4 from G G1, G G2, G G3 where G1.to = G2.from and G3.from = G2.from
Inner loop: Example x x x x x • SQL expression for filling the frequency table: select distinct E.x1, E.x3, count(E.x4) from E, CanTab{x2}{x1,x3} as CT where E.x1 = CT.x1 and E.x3 = CT.x3 group by E.x1, E.x3 having count(E.x4) >= k
Algorithm: Mining association rules Loop 1:Generate incrementally all possible trees T of increasing sizes. Loop 2: For each T, generate all frequent tree patterns P based T. Loop 3: For each P, generate all containment mappings f from Pleft to P. Loop 4: For each f, generate Q=(f(Hleft),P) and all parameter instantiations for Qleft Q.
Pattern database • For each P a table FreqTabP, that contains all frequent parameter instantiations. Pattern Database