Mining Tree-Query Associations in a Graph

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium

Graph Data A (directed) graph over a set of nodes N is a set G of edges: ordered pairs ij with ij  N. Snapshot of a graph representing the complete metabolic pathway of a human.

Graph Mining Transactional category • dataset: set of many small graphs (transactions) • frequency: transactions in which the pattern occurs (at least once) • ILP:Warmr [AGM, FSG, TreeMiner, gSpan, FFSM] Single graph category • dataset: single large graph • frequency: copies of the pattern in the large graph [Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom] Focus on pattern mining, few work on association rule mining!

Our work • Single graph category • Pattern + association rule mining • Patterns with: • Existential nodes • Parameters • Occurrence of the pattern in G is any homomorphism from the pattern in G. • So far only considered in the ILP (transactional) setting

Example of a pattern frequencyx z5z G z8 G zx G

Patterns are conjunctive queries. select distinct G3.to as x from G G1, G G2, G G3 where G1.from=5 and G1.to=G2.from and G1.to=G3.from and G2.to=8 frequencyx z5z G z8 G zx G

Example of an Association Rule

Features of the presented algorithms • Pattern mining phase + association mining phase • Restriction to trees => efficient algorithms • Equivalence checking • Apply theory of conjunctive database queries • Database oriented implementation

Outline rest of talk • Formal problem definition • Algorithms: • Pattern Mining • Overall approach • Outer loop: incremental • Inner loop: levelwise • Equivalence checking • Association Rule Mining • Result management • Experimental results • Future work

Formal definition of a tree pattern. A tree pattern is a tree P whose nodes are called variables, and: • some variables marked as existential • some variables are parameters(labeled with a constant) • remaining variables are called distinguished

Formal definition of a tree query. A tree queryQ is a pair (H,P) where: • P is a tree pattern, the body of Q • H is a tuple of distinguished variables and parameters of P. All distinguished variables of P must appear at least once in H, the head of Q

Formal definition of a matching A matching of a pattern P in a graph G is a homomorphism h: P  G, with hza, for parameters labeled a.

Example: Matching

Formal definition of frequency The frequency of Q in G is #answers in the answer set. We define the answer set of Q in G as follows: Q(G):={f(H)|f is a matching of P in G

Example: Matching   frequency 

Problem statement 1: Tree query mining Given a graph G and a threshold k, find all tree queries that have frequency at least k in G, those queries are called frequent.

Formal definition of an association rule An association rule (AR) is of the form Q1 Q2 with Q1 and Q2 tree queries. The AR is legal if Q2 Q1. The confidence of the AR in a graph G is defined as the frequency of Q2 divided by the frequency of Q1.

Problem statement 2: Association rule mining • Input: a graph G, minsup, a tree query Qleft frequent in G, minconf • Output: all tree queries Q such that QleftQ is a legal and confident association rule in G.

Outline rest of talk • Formal problem definition • Algorithms: • Pattern Mining • Overall approach • Outer loop: incremental • Inner loop: levelwise • Equivalence checking • Association Rule Mining • Result management • Experimental results • Future work

Pattern Mining Algorithm x1 x2 x4 x3 x        x2 x2 x1 x1 Outer loop: Generate,incrementally, all possible trees of increasing sizes. Avoid generation of isomorphic trees. Inner loop: For each newly generated tree, generate all queries based on that tree, and test their frequency. ...

Outer loop • It is well known how to efficiently generate all trees uniquely up to isomorphism • Based on canonical form of trees. • [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]

Inner loop: Levelwise approach • A query Q is characterized by • Q set of existential nodes • Q set of parameters • Labeling Qof the parameters by constants. • Qspecializes Q if  , and  agrees with  on . • If Qspecializes Q then freqQ freqQ • Most general query: T = (, , )

Inner loop: Candidate generation • CanTab is a candidate query FreqTabis a frequent query • Q’=’’ is aparent of Q= if either: • ’ and  has precisely one more node than ’, or • ’ and  has precisely one more node than ’ • Join Lemma: Each candidacy table can be computed by taking the natural join of its parent frequency tables.

Inner loop: Frequency counting • Each candidacy table can be computed by a single SQL query. (ref. Join lemma). • Suppose: Gfromto table in the database, then each frequency table can be computed with a single SQL query. •  • formulate in SQL and count •  • formulate  in SQLE • natural join of E with CanTab • group by  • count each group

Inner loop: Example x x x x x

Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx

Inner loop: Example x x x x x • SQL expression E for x select distinct G1.from as x1, G2.to as x3, G3.to as x4 from G G1, G G2, G G3 where G1.to = G2.from and G3.from = G2.from

Inner loop: Example x x x x x • SQL expression for filling the frequency table: select distinct E.x1, E.x3, count(E.x4) from E, CanTab{x2}{x1,x3} as CT where E.x1 = CT.x1 and E.x3 = CT.x3 group by E.x1, E.x3 having count(E.x4) >= k

Equivalent queries Queries Q and Q are equivalent if same answer sets on all graphs G (up to renaming of the distinguished variables) • 2 cases of equivalent queries: • Q1 has fewer nodes than Q2 • Q1 and Q2 have the same number of nodes

Equivalence theorem Two queries are equivalent if and only if there are containment mappings between them in both directions. A containment mapping from Q to Q is a h: QQ that maps distinguished variables ofQ one-to-one to distinguished variables of Q, and maps parameters of Q to parameters of Q, preserving labels

Case : Q fewer nodes than Q2 Redundancy lemma: Let Q be a tree query without selected nodes. Then Q has a redundancy if and only if it contains a subtree C in the form of a linear chain of  nodes (possibly just a single node), such that the parent of C has another subtree that is at least as deep asC. Redundant subtree

Case : Q and Q same number of nodes • Q and Q must be isomorphic. • Canonical form of queries: refine the canonical ordering of the underlying unlabeled tree, taking into account node labels.

Association Mining Algorithm • Input: a graph G, minsup, a tree query Qleft frequent in G, minconf • Output: all tree queries Q such that QleftQ is a legal and confident association rule in G.

Containment mappings • For each tree query, generate all containment mappings from Qleft to Q, ignoring parameter assignments.

Instantiations • For each containment mapping, generate all parameter assignments such that Qleft Q is frequent and confident.

Equivalent Association rules • Equivalence checking of association rules is as hard as general graph isomorphism testing.

Outline rest of talk • Result management • Experimental results • Future work

Result management • Output: frequency tables stored in a relational database. • Browser

Experimental results: Real-life datasets • Food webnodesedges frequency = 176

Experimental results: Real-life datasets • Food webnodesedges confidence = 11%

Experimental results: Performance • Fully implemented on top of IBM DB2 • Preliminary performance results: • pattern mining algorithm: • adequate performance • huge number of patterns • constant overhead per discovered pattern • association mining algorithm: • very fast • constant overhead per discovered rule

Future work • Applications: scientific data mining • Loosen restriction to trees

References • Bart Goethals, Eveline Hoekx and Jan Van den Bussche, Mining Tree Queries in a Graph, in Proceedings of the eleventh ACM SIGKDD International conference on Knowledge Discovery and Data Mining, p 61-69, ACM Press 2005 • Eveline Hoekx and Jan Van den Bussche, Mining for Tree-Query Associations in a Graph, to appear in Proceedings of the 2006 IEEE International Conference on Data Mining (ICDM 2006)

Mining Tree-Query Associations in a Graph

Mining Tree-Query Associations in a Graph

Presentation Transcript

Peta-Graph Mining

Mining Query Logs

The Graph Query Language: Towards a Unification of Graph Query Approaches

Query Preserving Graph Compression

Fast Frequent Free Tree Mining in Graph Databases

GRAPH MINING a general overview of some mining techniques

Graph Indexing: Tree + Δ ≥ Graph

Reachability Query Over A Large Graph

Query Tree Disassembly

Graph mining in bioinformatics

Large Graph Mining

Tree and Graph Drawing

Mining for Tree-Query Associations in a Graph

Large Graph Mining

Mining Associations

A Query-Based Routing Tree in Sensor Networks

Mining Query Logs

The Graph Query Language: Towards a Unification of Graph Query Approaches

Tree and Graph Drawing