Inves: Incremental Partitioning-based Verification for Graph Similarity Search

Inves: Incremental Partitioning-based Verification for Graph Similarity Search Jongik Kim1, Dong-Hoon Choi2, and Chen Li3 1Chonbuk National University, South Korea 2Korea Institute of Science and Technology Information, South Korea 3University of California, Irvine

Introduction- Graph Similarity Search • Graph Data Model • Graphs are ubiquitous and abundant in real-world data • Finding occurrences of a graph from a database is an essential operation • We need to tolerate noises, distortion, and different representations of graphs  Calls for graph similarity search • Graph Similarity Search • Important access method in many research areas • Cheminformatics: predicting properties of chemicals, drug design • Bioinformatics: similar DNA interactions • CV&PR: object detection, fingerprint identification …

Introduction- Graph Edit Distance • Graph Edit Distance (GED) • A general metric to measure the similarity between two graphs • The minimum number of graph edit operations to transform one graph to the other graph • insertion of a single vertex or edge • deletion of a single vertex or edge • substitution of the label of a single vertex or edge vertex labels denote atom symbols C C C O S N S N O N O O S S N N N C C O O C C C O C x y GED(x, y) = 3 edge labels (single and double lines) denote chemical bonds • GED computation is NP-hard

Graph Similarity Search • Graph Similarity Search with a GED constraint Given a graph database, and a query graph with a GED threshold τ, graph similarity search is to find all graphs in the database whose GED from the query graph is within τ • Filtering-and-Verification Framework • Using a feature-based index, filtering data graphs to generate candidate graphs • Verifying each candidate graph by computing GED with the query graph Main focus of existing work

Ip2 = Previous Work- Partition-based Approach Ip1 = Ip3 = O N g1 g1 C g2 g2 • Partition-based Approach • Given two graphs x and y, consider x is decomposed into τ+1 partitions • GED(x, y) ≤ τat least one partition of x is contained in y(i.e., no partition of x is contained in y GED(x, y) > τ) C O N N C F C • Filtering with a Partition-based Index [Pars, MLIndex] F F S S query graph q C C C C decompose into τ +1 partitions O O C C τ = 1 O O N N C C C C F O S N F O S C } DB { , C C C C C C g1 g2 p1 is contained in q Ip1 = Ip2 = Ip3 = Index g1 is a candidate graph g1 g1 g2 g2 Offline Processing

Motivation of Our Work (1/2) Problem with Existing Index-based Filtering • An offline partitioning of a data graph cannot work well for all queries  Suffer from many candidate graphs and an expensive verification phase C O N N F F O O S S N N C F C C C C C C C alternative partitioning of g1 original partitioning of g1 F S query graph q C C decompose into τ +1 partitions O C τ = 1 O N C C F O S N F O S C } DB { , C C C C C C g1 g2 p1 is contained in q Ip1 = Ip2 = Ip3 = Index g1 is a candidate graph g1 g1 g2 g2 Offline Processing

Motivation of Our Work (2/2) • Motivation • Refine each candidate by partitioning it based on the query graph Cost for partitioning and containment tests • << Cost for GED computation Candidate Generation Candidate Refinement GED Computation Filtering Phase Verification Phase (scope of our work)

Candidate Verification Scheme • Partition-based GED Lower Bound • Given two graphs x and y witha partitioning of x, P(x) = {p1, p2, …, pk},a GED lower bound between x and y is lb(x, y) = |{p | p ∈ P(x) and p is not contained in y}| p is called a mismatching partition • Candidate Verification • For a candidate x and a query y with a GED threshold τ, • iflb(x, y) > τthenprune x • else if GED(x, y) > τthenprune x • elsex is an answer of the query  We compute the GED only when the lower bound is not greater than τ • Goal • Tightening the lower bound by developing a novel partitioning strategy • Exploiting partitioning results to accelerate GED computation

Tightening the Lower Bound- Measure for a Good Partitioning See the paper for a detailed analysis of the tightness of the partition-based lower bound For every mismatching partition p in P(x), C1:Edit errors in p is indivisible and minimal indivisibility – p cannot be decomposed into two mismatching partitions minimality –pbecomes a matching partition if we remove any vertex in p C2: An edit error in a bridge of p is captured by p, while preserving C1 bridge – an edge connecting p to another partition Example p1 p2 p2 p1 p3 N O S N N O S N C F C F lb(x, y) = 1 lb(x, y) = 2 N C N O O N S N C C F F p1 p2 p3 p1 p2 p3 x y N O N O S N S N p4 p4 C C F F lb(x, y) = 4 lb(x, y) = 3

Tightening the Lower Bound- Incremental Partitioning lb(x, y) = 2 Incremental Partitioning Strategy 1. Perform a containment test of x against y by investigating vertices in x one after another 2. As soon as the test fails, isolate the investigated vertices and edges connecting them into a separate partition 3. Repeat it using the remaining part of x See the paper for the proof For a mismatching partition p, p cannot be decomposed into two mismatching partitions exactly meets the indivisibility constraint of the measure occurrence o of p3

Tightening the Lower Bound- Bridge Constraint lb(x, y) = 2 Bridge: an edge connecting one partition to another partition 3 Bridge difference between p3 and o u7 u8 u6 matching partition p3 O C C v3 v6 v7 occurrence o of p3 C O C 3 + 0 + 0 = 3  B(p3, o): edit errors in the bridges of p3 An error in a bridge can be counted twice mismatching A partition can use a half of the errors in its bridges See the paper for the proof Bridge Constraint: If B(p, o) > 1, p is mismatching with o Pushing the bridge constraint into the containment test approximately meets the bridge error condition in the measure occurrence o of p3

Tightening the Lower Bound- Rematch Method Edit errors in a mismatching partition p are mainly caused by the last vertex(without the last vertex, p is a matching partition) u4u3u2u1 • Rematch Method • Reorder vertices in p • Fixing the last vertex as the start vertex • Infrequent vertices and edges first while preserving the vertex connectivity • Rematch p with the new vertex ordering We can expect the edit errors can be detected in a smaller substructure • Further optimization • Repeat rematching while the size of p decreases  approximately meets the minimality constraint in the measure lb(x, y) = 1 lb(x, y) = 2

Improving GED Computation- Exploiting information from partitioning • Existing GED Computation Method • The most widely used GED computation method is based on A* • Considering all possible vertex mappings between two graphs in a best first fashion • Each internal state of the state-space tree denotes a partial vertex mapping • For each active state, calculating an estimated distance as the sum of the existing distancein mapped vertices and edges and an estimated distance of unmapped parts • Selecting a state having a minimum distance and expanding the state-spacetree  We have a method to accurately estimate a distance of a partial vertex mapping See the paper for the details • Place vertices in mismatching partitions first! • Since the existing edit errors in mapped vertices and edges are exactly calculated,we can find many edit errors at higher levels of the state-space tree Significantly reduce the search space of the A* algorithm • This approach can be accelerated by our incremental partitioning techniquebecause our technique makes the size of a mismatching partition as small as possible

Experiments- Experimental Setup • Platform • 32GB RAM Intel core i7 at 3.4GHzrunning a 64-bit Ubuntu OS Dataset Synthetic dataset – see the paper for the details and results • Query Workloads • Randomly selected from datasets • For each dataset, 100 queries are selected • Results are reported on the basis of 100 queries • Search Algorithms • G - GSimSearch [ICDE 2012, VLDB J. 2013] • P - Pars [PVLDB 2013] • M - MLIndex [ICDE 2017]

Experiments- AIDS Dataset ✽ y axes are log scaled in all figures

Experiments- PubChem Dataset ✽ y axes are log scaled in all figures

Experiments- Protein Dataset ✽ y axes are log scaled in all figures

Conclusions Observation: Online dynamic partitioning of a candidate graph can reduce the cost of verification Key Idea: Judiciously incrementally partitioning a candidate graph to tighten the GED lower bound Exploiting the collected information in partitioning to accelerate GED computation Key Results: Enhanced the performance of graph similarity search significantly Thank you! 

Improving GED Computation- Existing A* Algorithm for GED S C root C O {u1v1} S C 1 C C • Estimating an edit distance of the partial mapping {u1v1}: g + h • g: existing edit distance of mapped part • h: estimated edit distance of unmapped part = 0 + 1 = 0 h = label differences of unmapped edges and unmapped vertices unmapped vertices x: 1 S, 2 C’s, and 1 O y: 1 S and 3 C’s unmapped edges x: 5 single bonds and 1 double bond y: 5 single bonds and 1 double bond No difference 1 difference (substituting O with C)

Improving GED Computation- Existing A* Algorithm for GED root {u1v1} {u1v1} {u1v4} {u1v5} {u1v2} {u1v3} {u1 ɛ} 1 3 2 2 2 3 {u1v1, u2v5} {u1v1, u2 ɛ} {u1v1, u2v4} {u1v1, u2v3} {u1v1, u2v2} 1 4 4 4 5

Improving GED Computation- Existing A* Algorithm for GED Pruning with a given threshold τ(e.g., τ = 2) root {u1v1} {u1v4} {u1v5} {u1v2} {u1v3} {u1 ɛ} 3 2 2 2 3 {u1v1, u2v5} {u1v1, u2 ɛ} {u1v1, u2v4} {u1v1, u2v3} {u1v1, u2v2} 1 4 4 4 5 In general, the performance of A* dependson the accuracy of an estimated distance • Repeat the same procedure until • there is no active node • a leaf node is found

Improving GED Computation- Improved Estimated Distance Consider the partial vertex mapping {u1v1, u2v2, u3v3} The existing distance g = 0 Previous work: label differences in unmapped part Unmapped vertices x: 1 C and 1 O y: 2 C’s Unmapped edges x: 3 single bonds and 1 double bond y: 3 single bonds and 1 double bond No difference 1 difference (substituting O with C) h = 0 + 1 = 1 Our approach: distinguish bridges from unmapped edges Bridge difference Unmapped vertices x: 1 C and 1 O y: 2 C’s Unmapped edges x: 1 single bond y: 1 single bond bet. u1 and v1 = 1 bet. u2 and v2 = 0 bet. u3 and v3 = 1 No difference 1 difference 2 differences h = 0 + 1 + 2 = 3  much more accurate estimation!!

Inves: Incremental Partitioning-based Verification for Graph Similarity Search