620 likes | 893 Views
北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University. Graph Similarity. Instructor: Lei Zou. Outline. Maximal Common Subgraph Minimal Edit Distance Graph Similarity Search. Outline. Maximal Common Subgraph Minimal Edit Distance Graph Similarity Search.
E N D
北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Graph Similarity Instructor: Lei Zou
Outline • Maximal Common Subgraph • Minimal Edit Distance • Graph Similarity Search
Outline • Maximal Common Subgraph • Minimal Edit Distance • Graph Similarity Search
Maximal Common Subgraph Def. 1 (Induced Subgraph). An induced subgraph is a set S of vertices of a graph G and those edges of G with both endpoints in S. Def. 2 (Maximal Common Induced Subgraph ) A graph G12 is a common induced subgraph of graphs G1 and G2 if G12 is isomorphic to induced subgraphs of G1 and G2, respectively. A maximum common induced subgraph (MCIS) consists of a graph G12 with the largest number of vertices meeting the aforementioned property
A A A A B C B C B C B C D D D D MCIS MCIS
Def. 3 (Maximum Common Edge Subgraph) An MCES is a subgraph consisting of the largest number of edges common to both G1 and G2 A A B C B C D D
Finding Maximal Common Subgraph • Maximum clique-based algorithm(for MCIS) Def. 4 The modular product of two graphs G1 and G2 is defined on the vertex set V (G1) × V (G2) with two vertices (ui vi ) and (uj vj ) being adjacent whenever 1. ui and vi have the same vertex label, so do uj and vj 2. (ui uj ) ∈ E(G1) and (vi vj ) ∈ E(G2), or 3. (ui uj) E(G1) and (vi vj ) E(G2).
Maximal Clique v1 u1 A A (u1, v1) (u3, v3) u2 u3 v3 v2 B C B C (u2, v2) (u4, v4) D v4 D u4 modular product (association graph) A Maximal Clique in the modular product corresponds to a maximal common induced subgraph
Def. 5 A clique in a graph G is a subset of vertices in the graph such that each pair of vertices in the subset is connected by an edge in the graph G. A maximal clique (极大团) is a clique that cannot be extended by including one more adjacent vertex, that is, a clique which does not exist exclusively within the vertex set of a larger clique. A maximum clique ( 最大团) is a clique of the largest possible size in a given graph. The clique number ω(G) of a graph G is the number of vertices in a maximum clique in G.
5 Maximal clique: (1,2,3) (1,3,4,5) A maximum clique: (1,3,4,5) 1 4 2 3
Finding Maximal Clique • Bron–Kerbosch algorithm(@1973) Basic Algorithm: R=null; and P=V(G); // V(G) denotes all vertices in G FindingMaximalClque(R,P): if P is empty: report R as a maximal clique for each vertex v in P: FindingMaximalClque (R ⋃ {v}, P ⋂ N(v)) // N(v) denotes all v’s neighbor vertices. Problem: It may generate duplicate answers
(1,2 ) (1,2,3) • (1) (1,3 ) (1,3,2) 1 2 3
R=null; and P=V(G); // V(G) denotes all vertices in G FindingMaximalClque(R,P): if P is empty: report R as a maximal clique for each vertex v in P: FindingMaximalClque (R ⋃ {v}, P ⋂ N(v)) // N(v) denotes all v’s neighbor vertices. P=P\ {v}; Problem: It may generate some un-maximal clique.
(1,2 ) (1,2,3) • (1) (1,3 ) 1 2 3
Bron–Kerbosch algorithm R=null; and P=V(G); // V(G) denotes all vertices in G FindingMaximalClque(R,P, S): if P and S are both empty: report R as a maximal clique for each vertex v in P: FindingMaximalClque (R ⋃ {v}, P ⋂ N(v), X ⋂ N(v)) // N(v) denotes all v’s neighbor vertices. P=P\ {v}; X= X ⋃ {v}; // why ???
Theorem. Given a vertex u, consider that all the maximal cliques containing Q ∪ {u} have been generated. Then, every new maximal clique containing Q, but not Q ∪ {u}, must contain at least one vertex q that is not adjacent to u.
Backtracking algorithms (e.g., McGregor algorithm) (for both MCIS and MCES) It can be suitably described through a State Space Representation . Each state s represents a common subgraph of the two graphs under construction. This common subgraph is part of the MCS to be eventually formed.
Outline • Maximal Common Subgraph • Minimal Edit Distance • Graph Similarity Search
Minimal Edit Distance • Six edit operations • Insert an isolated vertex • Delete an isolated vertex • Change the label of a vertex • Insert an edge between two disconnected vertices • Delete an edge from two connected vertices • Change the label of an edge • Graph Edit Distance: • The minimum operations needed to transform a graph to another one (NP-Hard)
Minimal Edit Distance A A A B D B D B C G1 A A B C B C MED(G1,G2)=4 D D G2
Minimal Edit Distance Given two graphs G1 and G2, assume that they have the same number of vertices. Define a function f: V(G1) V(G2). The distance under this function is:
Minimal Edit Distance The distance between G1 and G2 is defined as We can prove that If G1 and G2 have different vertex numbers, assume that |V(G1)| < |V(G2)|, we introduce |V(G2)|-|V(G1)| pseudo vertices, the following equation still holds.
Minimal Edit Distance A A B D G1 B C D G2
Minimal Edit Distance • Exact Algorithm (A*-algorithm ) What’s A*-algorithm: A* uses a best-first search and finds a least-cost path from a given initial node to one goal node (out of one or more possible goals). As A* traverses the graph, it follows a path of the lowest known heuristic cost, keeping a sorted priority queue of alternate path segments along the way. where g(x) denotes the cost from the starting node to the current node; h(x) denotes the "heuristic estimate“ (lower bound) of the distance from to the goal
Minimal Edit Distance Given two graphs G1 and G2 have the same number m of vertices, let us consider the following process. Let N1 and N2 denote the vertices in G1 and G2 that have been matched. N1=(v1,v2,…,vn); N2=(u1,u2,…,un); Let M1 and M2 denote the vertices in G1 and G2 that have not been matched. M1=(vn+1,vn+2,…,vm); M2=(un+1,un+2,…,um).
Outline • Maximal Common Subgraph • Minimal Edit Distance • Graph Similarity Search
Comparing Stars Comparing Stars: On Approximating Graph Edit Distance Zhiping Zeng, Anthony K.H. Tung, Jianyong Wang, Jianhua Feng, Lizhu Zhou @VLDB09
Comparing Stars • 问题定义 Given a graph database D consisting of n graphs • Approximate full graph search • Find all the graphs in D s.t. { gi | GED(q,gi) ≤𝜏 } • Approximate subgraph search • Find all the graphs in D s.t. { gi |GED(q,r) ≤𝜏 and r gi }
Comparing Stars • Main Idea: G star structures • Star Structure: 三元组(r,L,l): r: root vertex L: the set of leaves l: labeling function
Comparing Stars T • Star edit distance
Comparing Stars • Given two multisets of star structures S1 and S2, P: S1 S2 , is a bijection. • Assignment Problem
Comparing Stars What’s the relationship between GED(g1,g2) and ?
Comparing Stars A distance function f is metricif and only if the following conditions hold:
Comparing Stars We can prove that • Graph edit distance is metric. (assume that all edit operation cost is non-negative) • Mapping distance is also metric.
Comparing Stars • Given two graphs g1 and g2, Let P=(p1, p2, . . . , pk) be an alignment transforming g1 to g2. Accordingly, there is a sequence of graphs g1=h0h1. . .hk=g2, where hi−1hi indicates that hi is the derived graph by performing pi over hi−1. As is metric, thus, we have the following equation:
Comparing Stars What’s the relationship between one operation pi and ? • Edge Insertion/Deletion One edge insertion/deletion at most affect two stars. Each star cost is at most 2. Thus, due to one edge insertion/deletion.
Comparing Stars What’s the relationship between one operation pi and ? 2. Vertex Insertion/Deletion One vertex insertion/deletion at most affect one star. Each star cost is at most 1. Thus, due to one vertex insertion/deletion.
Comparing Stars What’s the relationship between one operation pi and ? 3. Vertex Relabeling One vertex relabeling v0 at most affects deg(v0)+1’s stars.
Comparing Stars Lower Bound
Comparing Stars Upper Bound: Based on the bipartite graph matching, we can define a upper bound for the edit distance.
Comparing Stars Experiment datasets • Real dataset • AIDS antivirus screen component. 42,687 chemical components • Synthetic dataset • 1000 graphs, average size:10
GSimJoin Efficient graph similarity joins with edit distance constraints Xiang Zhao, Chun Xiao, Xuemin Lin, and Wei Wang. @ICDE12
GSimJoin • 问题定义 Given two sets of graphs 𝑅 and 𝑆, a graph similarity join with edit distance threshold 𝜏 returns pairs of graphs from each set, such that their graph edit distance is no larger than𝜏, i.e., { ⟨𝑟, 𝑠⟩ ∣ 𝑔𝑒𝑑(𝑟, 𝑠) ≤ 𝜏, 𝑟 ∈ 𝑅, 𝑠 ∈ 𝑆 }. • this paper will focus on the self-join case {⟨𝑟𝑖, 𝑟𝑗⟩ ∣ 𝑔𝑒𝑑(𝑟𝑖, 𝑟𝑗) ≤ 𝜏 ∧ 𝑟𝑖.𝑖𝑑 <𝑟𝑗.𝑖𝑑,𝑟𝑖∈ 𝑅, 𝑟𝑗∈ 𝑅}.
GSimJoin • Definition (path-based 𝑞-gram): A path-based 𝑞-gram in a graph 𝑟 is a simple path of length 𝑞.
GSimJoin Let Qr denote the multiset of q-gram in a graph r and Qru denote the multiset of q-grams that contain the vertex u.