1 / 22

Inves: Incremental Partitioning-based Verification for Graph Similarity Search

Explore partition-based approach for efficient graph similarity search, enhancing GED computation by tightening lower bounds. Discover a novel partitioning scheme for accelerated processing.

mproctor
Download Presentation

Inves: Incremental Partitioning-based Verification for Graph Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inves: Incremental Partitioning-based Verification for Graph Similarity Search Jongik Kim1, Dong-Hoon Choi2, and Chen Li3 1Chonbuk National University, South Korea 2Korea Institute of Science and Technology Information, South Korea 3University of California, Irvine

  2. Introduction- Graph Similarity Search • Graph Data Model • Graphs are ubiquitous and abundant in real-world data • Finding occurrences of a graph from a database is an essential operation • We need to tolerate noises, distortion, and different representations of graphs  Calls for graph similarity search • Graph Similarity Search • Important access method in many research areas • Cheminformatics: predicting properties of chemicals, drug design • Bioinformatics: similar DNA interactions • CV&PR: object detection, fingerprint identification …

  3. Introduction- Graph Edit Distance • Graph Edit Distance (GED) • A general metric to measure the similarity between two graphs • The minimum number of graph edit operations to transform one graph to the other graph • insertion of a single vertex or edge • deletion of a single vertex or edge • substitution of the label of a single vertex or edge vertex labels denote atom symbols C C C O S N S N O N O O S S N N N C C O O C C C O C x y GED(x, y) = 3 edge labels (single and double lines) denote chemical bonds • GED computation is NP-hard

  4. Graph Similarity Search • Graph Similarity Search with a GED constraint Given a graph database, and a query graph with a GED threshold τ, graph similarity search is to find all graphs in the database whose GED from the query graph is within τ • Filtering-and-Verification Framework • Using a feature-based index, filtering data graphs to generate candidate graphs • Verifying each candidate graph by computing GED with the query graph Main focus of existing work

  5. Ip2 = Previous Work- Partition-based Approach Ip1 = Ip3 = O N g1 g1 C g2 g2 • Partition-based Approach • Given two graphs x and y, consider x is decomposed into τ+1 partitions • GED(x, y) ≤ τat least one partition of x is contained in y(i.e., no partition of x is contained in y GED(x, y) > τ) C O N N C F C • Filtering with a Partition-based Index [Pars, MLIndex] F F S S query graph q C C C C decompose into τ +1 partitions O O C C τ = 1 O O N N C C C C F O S N F O S C } DB { , C C C C C C g1 g2 p1 is contained in q Ip1 = Ip2 = Ip3 = Index g1 is a candidate graph g1 g1 g2 g2 Offline Processing

  6. Motivation of Our Work (1/2) Problem with Existing Index-based Filtering • An offline partitioning of a data graph cannot work well for all queries  Suffer from many candidate graphs and an expensive verification phase C O N N F F O O S S N N C F C C C C C C C alternative partitioning of g1 original partitioning of g1 F S query graph q C C decompose into τ +1 partitions O C τ = 1 O N C C F O S N F O S C } DB { , C C C C C C g1 g2 p1 is contained in q Ip1 = Ip2 = Ip3 = Index g1 is a candidate graph g1 g1 g2 g2 Offline Processing

  7. Motivation of Our Work (2/2) • Motivation • Refine each candidate by partitioning it based on the query graph Cost for partitioning and containment tests • << Cost for GED computation Candidate Generation Candidate Refinement GED Computation Filtering Phase Verification Phase (scope of our work)

  8. Candidate Verification Scheme • Partition-based GED Lower Bound • Given two graphs x and y witha partitioning of x, P(x) = {p1, p2, …, pk},a GED lower bound between x and y is lb(x, y) = |{p | p ∈ P(x) and p is not contained in y}| p is called a mismatching partition • Candidate Verification • For a candidate x and a query y with a GED threshold τ, • iflb(x, y) > τthenprune x • else if GED(x, y) > τthenprune x • elsex is an answer of the query  We compute the GED only when the lower bound is not greater than τ • Goal • Tightening the lower bound by developing a novel partitioning strategy • Exploiting partitioning results to accelerate GED computation

  9. Tightening the Lower Bound- Measure for a Good Partitioning See the paper for a detailed analysis of the tightness of the partition-based lower bound For every mismatching partition p in P(x), C1:Edit errors in p is indivisible and minimal indivisibility – p cannot be decomposed into two mismatching partitions minimality –pbecomes a matching partition if we remove any vertex in p C2: An edit error in a bridge of p is captured by p, while preserving C1 bridge – an edge connecting p to another partition Example p1 p2 p2 p1 p3 N O S N N O S N C F C F lb(x, y) = 1 lb(x, y) = 2 N C N O O N S N C C F F p1 p2 p3 p1 p2 p3 x y N O N O S N S N p4 p4 C C F F lb(x, y) = 4 lb(x, y) = 3

  10. Tightening the Lower Bound- Incremental Partitioning lb(x, y) = 2 Incremental Partitioning Strategy 1. Perform a containment test of x against y by investigating vertices in x one after another 2. As soon as the test fails, isolate the investigated vertices and edges connecting them into a separate partition 3. Repeat it using the remaining part of x See the paper for the proof For a mismatching partition p, p cannot be decomposed into two mismatching partitions exactly meets the indivisibility constraint of the measure occurrence o of p3

  11. Tightening the Lower Bound- Bridge Constraint lb(x, y) = 2 Bridge: an edge connecting one partition to another partition 3 Bridge difference between p3 and o u7 u8 u6 matching partition p3 O C C v3 v6 v7 occurrence o of p3 C O C 3 + 0 + 0 = 3  B(p3, o): edit errors in the bridges of p3 An error in a bridge can be counted twice mismatching A partition can use a half of the errors in its bridges See the paper for the proof Bridge Constraint: If B(p, o) > 1, p is mismatching with o Pushing the bridge constraint into the containment test approximately meets the bridge error condition in the measure occurrence o of p3

  12. Tightening the Lower Bound- Rematch Method Edit errors in a mismatching partition p are mainly caused by the last vertex(without the last vertex, p is a matching partition) u4u3u2u1 • Rematch Method • Reorder vertices in p • Fixing the last vertex as the start vertex • Infrequent vertices and edges first while preserving the vertex connectivity • Rematch p with the new vertex ordering We can expect the edit errors can be detected in a smaller substructure • Further optimization • Repeat rematching while the size of p decreases  approximately meets the minimality constraint in the measure lb(x, y) = 1 lb(x, y) = 2

  13. Improving GED Computation- Exploiting information from partitioning • Existing GED Computation Method • The most widely used GED computation method is based on A* • Considering all possible vertex mappings between two graphs in a best first fashion • Each internal state of the state-space tree denotes a partial vertex mapping • For each active state, calculating an estimated distance as the sum of the existing distancein mapped vertices and edges and an estimated distance of unmapped parts • Selecting a state having a minimum distance and expanding the state-spacetree  We have a method to accurately estimate a distance of a partial vertex mapping See the paper for the details • Place vertices in mismatching partitions first! • Since the existing edit errors in mapped vertices and edges are exactly calculated,we can find many edit errors at higher levels of the state-space tree Significantly reduce the search space of the A* algorithm • This approach can be accelerated by our incremental partitioning techniquebecause our technique makes the size of a mismatching partition as small as possible

  14. Experiments- Experimental Setup • Platform • 32GB RAM Intel core i7 at 3.4GHzrunning a 64-bit Ubuntu OS Dataset Synthetic dataset – see the paper for the details and results • Query Workloads • Randomly selected from datasets • For each dataset, 100 queries are selected • Results are reported on the basis of 100 queries • Search Algorithms • G - GSimSearch [ICDE 2012, VLDB J. 2013] • P - Pars [PVLDB 2013] • M - MLIndex [ICDE 2017]

  15. Experiments- AIDS Dataset ✽ y axes are log scaled in all figures

  16. Experiments- PubChem Dataset ✽ y axes are log scaled in all figures

  17. Experiments- Protein Dataset ✽ y axes are log scaled in all figures

  18. Conclusions Observation: Online dynamic partitioning of a candidate graph can reduce the cost of verification Key Idea: Judiciously incrementally partitioning a candidate graph to tighten the GED lower bound Exploiting the collected information in partitioning to accelerate GED computation Key Results: Enhanced the performance of graph similarity search significantly Thank you! 

  19. Improving GED Computation- Existing A* Algorithm for GED S C root C O {u1v1} S C 1 C C • Estimating an edit distance of the partial mapping {u1v1}: g + h • g: existing edit distance of mapped part • h: estimated edit distance of unmapped part = 0 + 1 = 0 h = label differences of unmapped edges and unmapped vertices unmapped vertices x: 1 S, 2 C’s, and 1 O y: 1 S and 3 C’s unmapped edges x: 5 single bonds and 1 double bond y: 5 single bonds and 1 double bond No difference 1 difference (substituting O with C)

  20. Improving GED Computation- Existing A* Algorithm for GED root {u1v1} {u1v1} {u1v4} {u1v5} {u1v2} {u1v3} {u1 ɛ} 1 3 2 2 2 3 {u1v1, u2v5} {u1v1, u2 ɛ} {u1v1, u2v4} {u1v1, u2v3} {u1v1, u2v2} 1 4 4 4 5

  21. Improving GED Computation- Existing A* Algorithm for GED Pruning with a given threshold τ(e.g., τ = 2) root {u1v1} {u1v4} {u1v5} {u1v2} {u1v3} {u1 ɛ} 3 2 2 2 3 {u1v1, u2v5} {u1v1, u2 ɛ} {u1v1, u2v4} {u1v1, u2v3} {u1v1, u2v2} 1 4 4 4 5 In general, the performance of A* dependson the accuracy of an estimated distance • Repeat the same procedure until • there is no active node • a leaf node is found

  22. Improving GED Computation- Improved Estimated Distance Consider the partial vertex mapping {u1v1, u2v2, u3v3} The existing distance g = 0 Previous work: label differences in unmapped part Unmapped vertices x: 1 C and 1 O y: 2 C’s Unmapped edges x: 3 single bonds and 1 double bond y: 3 single bonds and 1 double bond No difference 1 difference (substituting O with C) h = 0 + 1 = 1 Our approach: distinguish bridges from unmapped edges Bridge difference Unmapped vertices x: 1 C and 1 O y: 2 C’s Unmapped edges x: 1 single bond y: 1 single bond bet. u1 and v1 = 1 bet. u2 and v2 = 0 bet. u3 and v3 = 1 No difference 1 difference 2 differences h = 0 + 1 + 2 = 3  much more accurate estimation!!

More Related