Efficiently Mining Frequent Trees in a Forest

Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki

Frequent Structure Mining (FSM) • Dealing with extracting patterns (association, sequence, frequent tree, graph, and etc.) in massive databases • Typical application • Bioinformatics • Web mining • Mining semi-structured documents

Tree Mining Problems • Goal: to efficiently enumerate all frequent subtrees in a forest (database of trees) according to a given minimum support (minsup) • The support of a subtree S is the number of trees in D that contains one occurrence of S. • A subtree S is frequent if its support is more than or equal to a user specified minsup value.

Rooted, Ordered & Labeled tree • A tree is an acyclic connected graph • Rooted: exist one vertices which is distinguished from others • Ordered: the children of each node in a rooted tree are ordered. • Labeled: each node is associated with a label. Every tree in the paper is a rooted, ordered and labeled tree.

Definition of Subtrees • We denote a tree as T = {N, B}. N is a set of labeled nodes and B is a set of branches. • We say that a tree S = {Ns, Bs} is an embedded subtree of T = {N, B}, if: • Ns is a subset of N • A branch appears in S iff two vertices are on the same path from the root to a leaf in T. • A disconnected pattern is a sub-forest of T. Hence, embedded trees allow not only direct parent-child branches, but also ancestor-descendant branches.

Examples of subtrees: 0 0 1 2 3 4 2 4 Subtree S 1 1 3 2 2 4 1 1 3 2 Tree T Not a subtree, a sub-forest 1

Each node has a well-defined number, i, according to its position in a depth-first traversal of a tree The label of each node is taken from a set of labels L = {0, 1, …, m-1}. It represents the value of each node. Node Numbers and Labels 0 0 1 5 2 4 7 6 2 4 1 3 2 1 3 1

Scope of Node [0,7] • The scope of each node ni is given as [i, r], i.e., the lower bound is the position (i) of itself, and the upper bound is the position (r) of its right-most leaf node. • Assume two node x, y has the following scope Sx = [ix, rx] and Sy = [iy, ry]. • Sx is strictly less than (<) Sy iff rx < ly, i.e., Sx occurs before Sy. It means that y is an embedded sibling of x • Sx contains Sy iff lx <= ly and rx >= ry. It means that y is a descendant of x 0 [1,4] [5,7] 2 4 [2,3] [4,4] [6,7] [7,7] 1 3 2 1 [3,3] 1

The String Encoding: 0 2 1 1 –1 –1 1 –1 –1 4 3 –1 2 –1 -1 To create String encoding, which is denoted as t, we perform a depth-first search starting (also ending) at the root, adding the current node’s label x to t. Whenever we backtrack from a child to its parent we add an special symbol –1 to the string. Representing trees as Strings 0 2 4 1 1 3 2 1

Equivalence Classes • Two k-subtrees X, Y are in the same prefix equivalence class iff they share a common prefix up to the (k-1)th nodes • Prefix String: 2 1 0 –1 3 • The following three subtrees are in the same prefix equivalence class: • 2 1 0 –1 3 –1 –1 x –1 // (x, 0) • 2 1 0 –1 3 –1 x –1 –1 // (x, 1) • 2 1 0 –1 3 x –1 –1 –1 // (x, 3) • Element list: (label, the position of the node which x is attached) • (x, 0); (x, 1); (x, 3) • A valid element x may be attached to only those that lie on the path from the root to the right-most leaf. 2 x 1 x 3 0 x x Not a valid element!

Candidate Generation: • Goal: Given an equivalence class of k-subtrees, try to obtain candidate (k+1)-subtrees. • Main idea: consider each pair of elements in the class for extension, including self-extension. • Theorem: Assume elements are kept sorted by node label as the primary key and position as the secondary key. Let P be a prefix class, and (x,i) and (y, j) denote any two elements in the class. Px denotes the class representing extension of element (x, i). Define (y,j) join (x,i ) as follows: Case I ( i = j ): 1) If P ≠ 0, add (y, j) and (y, j+1) to Px. 2) If P = 0, add (y, j) to Px. Case II ( i > j ): add (y,j) to Px Case III ( i < j ): no new element is possible in this case • The Theorem has a mistake.

1 1 Prefix: 1 2 Element List: (3, 1); (4, 0) 2 2 4 3 Prefix = 1 2 3 Prefix = 1 2 –1 4 1 1 1 1 1 2 4 2 4 4 2 2 2 4 (4,2) 4 (4,0) 3 3 3 3 (4,0) join (4,0) (4,0) (3,2) (3,1) 3 If we add (y, j+1), i.e., (4, 1), we get the following tree: 1 2 4 –1 4, wrong! (3,1) join (3,1) (4,0) join (3,1)

TreeMiner Algorithm • TreeMiner (D(database of tree, Forest), minsup) • F1 = { frequent 1-subtrees }; • F2 = { classes [P]1 of frequent 2-subtrees }; • For all [P], do Enumerate-Frequent-Subtree; • Enumerate-Frequent-Subtree Fk • For each element (x, i) € [P] do • For each element (y, j) € [P] do • (y,j) join (x, i) => at most two new candidate subtrees • For each subtree, do scope-list joins • If it is frequent, then we add the subtree to the list of frequent-subtree. • Repeated until all frequent subtrees have been enumerated. P: prefix class. [P]1 means the prefix size = 1, i.e., only one node in the prefix class. Px refers to the new prefix tree formed by adding (x, i) to P. Fk: the set of all frequent subtrees of size k.

An example of TreeMiner Algorithm 0 D in Horizontal Format: (tree-id, string encoding): (T0, 1 2 –1 3 4 –1 –1) (T1, 2 1 2 –1 4 –1 –1 2 –1 3 –1) (T2, 1 3 2 –1 –1 5 1 2 –1 3 4 –1 –1 –1 -1) 0 1 1 1 3 1 2 3 5 2 3 2 4 3 2 1 4 5 6 2 3 D in Vertical Format ( tree-id, scope) pairs: 1 2 3 4 5 0, [0,3] 0, [1,1] 0, [2,3] 0, [3,3] 2, [3,7] 1, [1,3] 1, [0,5] 1, [5,5] 1, [3,3] 2, [0,7] 1, [2,2] 2, [1,2] 2, [7,7] 2, [4,7] 1, [4,4] 2, [6,7] 2, [2,2] 2, [5,5] Tree T0 7 4 0 2 Tree T2 1 4 5 2 1 3 Database D of 3 Trees 2 3 4 2 Tree T1

Scope-List Joins Example: minsup = 100% Step 1: Calculate F1: Prefix = {}, Element list: (1,-1), (2,-1), (3,-1), (4,-1) Step 2: Calculate F2: Suppose Prefix = {1}, Element list:(2,0), (4,0) Step 3: Calculate F3: Suppose Prefix = {1,2}, Element list:(4,0) 1 2 3 4 1 1 1 0,[0,3]* 0,[1,1] 0,[2,3] 0,[3,3] 1,[1,3] 1,[0,5] 1,[5,5] 1,[3,3] 2,[0,7] 1,[2,2] 2,[1,2] 2,[7,7] 2,[4,7] 1,[4,4] 2,[6,7] 2,[2,2] 2,[5,5] 4 4 2 2 0,0,[1,1]* 0,0,[3,3] 1,1,[2,2] 1,1,[3,3] 2,0,[2,2] 2,0,[7,7] 2,0,[5,5] 2,4,[7,7] 2,4,[5,5] 0,01,[3,3]* 1,12,[3,3] 2,02,[7,7] 2,05,[7,7] 2,45,[7,7] Infrequent Element: (5,-1) *: 0 – tree id [0,3] – node scope Infrequent Element: (1,0), (3,0) Infrequent Element: (2,0), (2,1), (4,0) *: 0 – tree id 0 – the node number (position) of the prefix {1} [1,1] – scope of the element node. *: 0 – tree id 01 – the node number (position) of the prefix {12} [3,3] – scope of the element node.

Conclusion • Introduce the notion of mining embedded subtrees in a (forest) database of trees • Systematic candidate subtree generation. No subtree is generated more than once. (but has a mistake) • Use a string encoding of tree to store dataset efficiently • Use a node’s scope to develop scope-lists • Introduce a new algorithm – TreeMiner

Efficiently Mining Frequent Trees in a Forest