620 likes | 771 Views
Frequent Structure Mining. Robert Howe University of Vermont Spring 2014. Original Authors. This presentation is based on the paper Zaki MJ (2002). Efficiently mining frequent trees in a forest. Proceedings of the 8th ACM SIGKDD International Conference .
E N D
Frequent Structure Mining Robert Howe University of Vermont Spring 2014
Original Authors • This presentation is based on the paper Zaki MJ (2002). Efficiently mining frequent trees in a forest. Proceedings of the 8th ACM SIGKDD International Conference. • The author’s original presentation was used to make this one. • I further adapted this from Ahmed R. Nabhan’s modifications.
Outline • Graph Mining Overview • Mining Complex Structures - Introduction • Motivation and Contributions of author • Problem Definition and Case Examples • Main Ingredients for Efficient Pattern Extraction • Experimental Results • Conclusions
Outline • Graph Mining Overview • Mining Complex Structures - Introduction • Motivation and Contributions of author • Problem Definition and Case Examples • Main Ingredients for Efficient Pattern Extraction • Experimental Results • Conclusions
Why Graph Mining? • Graphs are convenient structures that can represent many complex relationships. • We are drowning in graph data: • Social Networks • Biological Networks • World Wide Web
UVM • High School • BU • Facebook Data • (Source: Wolfram|Alpha Facebook Report)
Facebook Data • (Source: Wolfram|Alpha Facebook Report)
Biological Data • (Source: KEGG Pathways Database)
Some Graph Mining Problems • Pattern Discovery • Graph Clustering • Graph Classification and Label Propagation • Structure and Dynamics of Evolving Graphs
Graph Mining Framework • Mining graph patterns is a fundamental problem in data mining. • Exponential Pattern Space • Relevant Patterns • Mine • Select • Graph Data • Structure Indices • Exploratory Task • Clustering • Classification
A • A • B • B Basic Concepts • C • C • D • Graph – A graph G is a 3-tuple G = (V, E, L) where • V is the finite set of nodes. • E ⊆ V × V is the set of edges • L is a labeling function for edges and nodes. • Subgraph – A graph G1 = (V1, E1, L1) is a subgraph of G2 = (V2, E2, L2) iff: • V1 ⊆ V2 • E1 ⊆ E2 • L1(v) = L2(v) for all v ∈ V1. • L1(e) = L2(e) for all e ∈ E1.
3 • A • 5 • B Basic Concepts • 4 • C • D • E • 1 • 2 • Graph Isomorphism – “A bijection between the vertex sets of G1 and G2 such that any two vertices u and v which are adjacent in G1 are also adjacent in G2.” (Wikipedia) • Subgraph Isomorphism is even harder (NP-Complete!)
Basic Concepts • Graph Isomorphism – Let G1 = (V1, E1, L1) and G2 = (V2, E2, L2). A graph isomorphism is a bijective function f: V1 → V2 satisfying • L1(u) = L1( f (u)) for all u ∈ V1. • For each edge e1 = (u,v) ∈ E1, there exists e2 = ( f(u), f(v)) ∈ E2 such that L1(e1) = L2(e2). • For each edge e2 = (u,v) ∈ E2, there exists e1 = ( f –1(u), f –1(v)) ∈ E1 such that L1(e1) = L2(e2).
Discovering Subgraphs • TreeMiner and gSpan both employ subgraph or substructure pattern mining. • Graph or subgraph isomorphism can be used as an equivalence relation between two structures. • There is an exponential number of subgraph patterns inside a larger graph (as there are 2n node subsets in each graph and then there are edges.) • Finding frequent subgraphs (or subtrees) tends to be useful in data mining.
Outline • Graph Mining Overview • MiningComplex Structures - Introduction • Motivation and Contributions of author • Problem Definition and Case Examples • Main Ingredients for Efficient Pattern Extraction • Experimental Results • Conclusions
Mining Complex Structures • Frequent structure mining tasks • Item sets – Transactional, unordered data. • Sequences – Temporal/positional, text, biological sequences. • Tree Patterns – Semi-structured data, web mining, bioinformatics, etc. • Graph Patterns – Bioinformatics, Web Data • “Frequent” is a broad term • Maximal or closed patterns in dense data • Correlation and other statistical metrics • Interesting, rare, non-redundant patterns.
Anti-Monotonicity • The black line is always decreasing • A monotonic function is a consistently increasing or decreasing function*. • The author refers to a monotonically decreasing function as anti-monotonic. • The frequency of a super-graph cannot be greater than the frequency of a subgraph (similar to Apriori). • * Very Informal Definition • (Source: SIGMOD ’08)
Outline • Graph Mining Overview • Mining Complex Structures - Introduction • Motivation andContributions of author • Problem Definition and Case Examples • Main Ingredients for Efficient Pattern Extraction • Experimental Results • Conclusions
Tree Mining – Motivation • Capture intricate (subspace) patterns • Can be used (as features) to build global models (classification, clustering, etc.) • Ideally suited for categorical, high-dimensional, complex, and massive data. • Interesting Applications • Semi-structured Data – Mine structure and content • Web usage mining – Log mining (user sessions as trees) • Bioinformatics – RNA secondary structures, Phylogenetic trees • (Source: University of Washington)
Classification Example • Subgraph patterns can be used as features for classification. • “Hexagons are a commonly occurring subgraph in organic compounds.” • Off-the-shelf classifiers (like neural networks or genetic algorithms) can be trained using these vectors. • Feature selection is very useful too.
Contributions • Systematic subtree enumeration. • Extensions for mining unlabeled or unordered subtrees or sub-forests. • Optimizations • Representing trees as strings. • Scope-lists for subtree occurrences.
Outline • Graph Mining Overview • Mining Complex Structures - Introduction • Motivation and Contributions of author • ProblemDefinition and Case Examples • Main Ingredients for Efficient Pattern Extraction • Experimental Results • Conclusions
How does searching for patterns work? • Start with graphs with small sizes. • Extend k-size graphs by one node to generate k + 1 candidate patterns. • Use a scoring function to evaluate each candidate. • A popular scoring function is one that defines the minimum support. Only graphs with frequency greater than minisup are kept.
How does searching for patterns work? • “The generation of size k + 1 subgraph candidates from size k frequent subgraphs is more complicated and more costly than that of itemsets” – Yan and Han (2002), on gSpan • Where do we add a new edge? • It is possible to add a new edge to a pattern and then find that doesn’t exist in the database. • The main story of this presentation is on good candidate generation strategies.
TreeMiner • TreeMiner uses a technique for numbering tree nodes based on DFS. • This numbering is used to encode trees as vectors. • Subtrees sharing a common prefix (e.g. the first k numbers in vectors) form an equivalence class. • Generate new candidate (k + 1)-subtrees from equivalence classes of k-subtrees (e.g. Apriori)
TreeMiner • This is important because candidate subtrees are generated only once! • (Remember the subgraph isomorphism problem that makes it likely to generate the same pattern over and over)
Definitions • Tree – An undirected graph where there is exactly one path between any two vertices. • Rooted Tree – Tree with a special node called root. • This tree has no root node. • It is an unrooted tree. • This tree has a root node. • It is a rooted tree.
Definitions • Ordered Tree – The ordering of a node’s children matters. • Example: XML Documents • Exercise – Prove that ordered trees must be rooted. • ≠ • v2 • v1 • v3 • v1 • v2 • v3
Definitions • Labeled Tree – Nodes have labels. • Rooted trees also have some special terminology. • Parent – The node one closer to the root. • Ancestor – The node n edges closer to the root, for any n. • Siblings – Two nodes with the same parent. • ancestor • embedded sibling • parent • embedded sibling • sibling • ancestor(X,Y) :- • parent(X,Y). • ancestor(X,Y) :- • parent(Z,Y), • ancestor(X,Z). • sibling(X,Y) :- • parent(Z,X), • parent(Z,Y).
Definitions • Embedded Siblings – Two nodes sharing a common ancestor. • Numbering – The node’s position in a traversal (normally DFS) of the tree. • A node has a number ni and a label L(ni). • Scope – The scope of a node nl is [l, r], where nris the rightmost leaf under nl (again, DFS numbering).
Definitions • v0 • Embedded Subtrees – S = (Ns, Bs) is an embedded subtree of T = (N, B)if and only if the following conditions are met: • Ns ⊆ N (the nodes have to be a subset). • b = (nx, ny) ∊ Bs iff nx is an ancestor of ny. • For each subset of nodes Ns there is one embedded subtree or subforest. • v1 • v6 • v7 • v8 • v2 • v3 • v4 • v5 • subtree • v1 • v4 • v5 • (Colors are only on this graph to show corresponding nodes)
Definitions • v0 • Match Label – The node numbers (DFS numbers) in T of the nodes in S with matching labels. • A match label uniquely identifies a subtree. • This is useful because a labeling function must be surjective but will not necessarily be bijective. {v1, v4, v5} or {1, 4, 5} • v1 • v6 • v7 • v8 • v2 • v3 • v4 • v5 • subtree • v1 • v4 • v5 • (Colors are only on this graph to show corresponding nodes)
Definitions • v0 • Subforest – A disconnected pattern generated in the same way as an embedded subtree. • v1 • v6 • v7 • v8 • v2 • v3 • v4 • v5 • subforest • v1 • v7 • v4 • v8 • (Colors are only on this graph to show corresponding nodes)
Problem Definition • Given a database (forest) D of trees, find all frequent embedded subtrees. • Frequent – Occurring a minimum number of times (use user-defined minisup). • Support(S) – The number of trees in D that contain at least one occurrence of S. • Weighted-Support(S) – The number of occurrences of S across all trees in D.
v1 • v0 Exercise • v1 • v6 • v7 • v2 • v5 • v8 • v2 • v3 Generate an embedded subtree or subforest for the set of nodes Ns = {v1, v2, v5}. Is this an embedded subtree or subforest, and why? Assume a labeling function L(x) = x. • v4 • v5 • This is an embedded subtree because all of the nodes are connected. • (*Cough* Exam Question *Cough*)
Outline • Graph Mining Overview • Mining Complex Structures - Introduction • Motivation and Contributions of author • Problem Definition and Case Examples • MainIngredients for Efficient Pattern Extraction • Experimental Results • Conclusions
Main Ingredients • Pattern Representation • Trees as strings • Candidate Generation • No duplicates. • Pattern Counting • Scope-based List (TreeMiner) • Pattern-based Matching (PatternMatcher)
String Representation • With N nodes, M branches, and a max fanout of F: • An adjacency matrix takes (N)(F + 1) space. • An adjacency list requires 4N – 2 space. • A tree of (node, child, sibling) requires 3N space. • String representation requires 2N – 1 space.
0 String Representation • 2 • 1 • 3 • 2 • String representation is labels with a backtrack operator, –1. • 1 • 2
Candidate Generation • Equivalence Classes – Two subtrees are in the same equivalence class iff they share a common prefix string P up to the (k – 1)-th node. • This gives us simple equivalence testing of a fixed-size array. • Fast and parallel – Can be run on a GPU. • Caveat – The order of the tree matters.
Candidate Generation • Generate new candidate (k + 1)-subtrees from equivalence classes of k-subtrees. • Consider each pair of elements in a class, including self-extensions. • Up to two new candidates for each pair of joined elements. • All possible candidate subtrees are enumerated. • Each subtree is generated only once!
Candidate Generation • Each class is represented in memory by a prefix string and a set of ordered pairs indicating nodes that exist in that class. • A class is extended by applying a join operator ⊗ on all ordered pairs in the class.
Candidate Generation • Equivalence Class • Prefix String 12 • 1 • 1 • 2 • 4 • 2 • 3 • This generates two elements: (3, v1) and (4, v0) • The element notation can be confusing because the first item is a label and the second item is a DFS node number.
Candidate Generation • Theorem 1. Define a join operator ⊗ on two elements as (x, i) ⊗ (y, j). Then apply one of the following cases: • If i = j and P is not empty, add (y, j) and (y, j + 1) to class [Px]. If P is empty, only add (y, j + 1) to [Px]. • If i > j, add (y, j) to [Px]. • If i < j, no new candidate is possible.
Candidate Generation • Consider the prefix class from the previous example: P = (1, 2) which contains two elements, (3, v1) and (4, v0). • Join (3, v1) ⊗ (3, v1) – Case (1) applies, producing (3, v1) and (3, v2) for the new class P3 = (1, 2, 3). • Join (3, v1) ⊗ (4, v0) – Case (2) applies. (Don’t worry, there’s an illustration on the next slide.)
1 • 1 Candidate Generation • 2 • 2 • 1 • 1 • 3 • 3 • = • ⊗ • 2 • 2 • 3 • 3 • 3 • A class with prefix {1,2} contains a node with label 3. This is written as (3, v1), meaning a node labeled ‘3’ is added at position 1 in DFS order of nodes. • 3 • Prefix = (1, 2, 3) • New nodes = (3, v2), (3, v1)
1 • 1 • 1 • 1 Candidate Generation • 2 • 2 • 4 • 2 • 2 • 3 • 3 • 3 • 1 • 3 • 3 • 4 • 2 • = • ⊗ • 3 • Prefix = (1, 2, 3) • New nodes = (3, v2), (3, v1),(4, v0)
The Algorithm TreeMiner( D, minisup ): F1 = { frequent 1-subtrees} F2 = { classes [P]1 of frequent 2-subtrees } for all [P]1 ∈ E do Enumerate-Frequent-Subtrees( [P]1 ) Enumerate-Frequent-Subtrees( [P] ): for each element (x, i) ∈ [P] do [Px] = ∅ for each element (y, j) ∈ [P] do R = { (x, i) ⊗ (y, j) } L(R) = { L(x) ∩⊗ L(y) } if for any R ∈ R, R is frequent, then [Px] = [Px] ∪ {R} Enumerate-Frequent-Subtrees( [Px] )
ScopeList Join • Recall that the scope is the interval between the lowest numbered child (or self) node and the highest numbered child node, using DFS numbering. • This can be used to calculate support. • [0, 8] • v0 • [1, 5] • v1 • v6 • v7 • [7, 8] • [2, 2] • v8 • [8, 8] • v2 • v3 • [3, 5] • v4 • v5 • [4, 4] • [5, 5]
ScopeList Join • ScopeLists are used to calculate support. • Let x and y be nodes with scopes sx = [lx, ux], sy = [ly, uy]. • sx contains syifflx ≤ ly and ux ≥ uy. • A scope list represents the entire forest.