1 / 33

Ye Yuan 1 , Guoren Wang 1 , Haixun Wang 2 , Lei Chen 3

This paper explores query processing on uncertain graphs, crucial in bioinformatics and social networks, proposing a framework, problem definition, and solutions for probabilistic subgraph search.

jpeek
Download Presentation

Ye Yuan 1 , Guoren Wang 1 , Haixun Wang 2 , Lei Chen 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan1, Guoren Wang1, Haixun Wang2, Lei Chen3 1. Northeastern University, China 2. Microsoft Resarch Asia 3. HKUST

  2. Outline Ⅰ Query Processing Framework Problem Definition Solutions Conclusions Background Ⅱ Ⅲ Ⅳ V

  3. Background • Graph is a complicated data structure, and has been used in many real applications. • Bioinformatics Gene regulatory networks Yeast PPI networks 3

  4. Background • Compounds benzene ring Compounds database 4

  5. Background • Social Networks Web2.0 EntityCube 5

  6. Therefore, it is important to study query processing on large uncertain graphs. Background • In these applications, graph data may be noisy and incomplete, which leads to uncertain graphs. • STRING database (http://string-db.org) is a data source that contains PPIs with uncertain edges provided by biological experiments. • Visual Pattern Recognition, uncertain graphs are used to model visual objects. • Social networks, uncertain links used to represent possible relationships or strength of influence between people. 6

  7. Outline Ⅰ Query Processing Framework Conclusions Problem Definition Solutions Background Ⅱ Ⅲ Ⅳ V

  8. Problem Definition • Probabilistic subgraph search • Uncertain graph: • Vertex uncertainty (existence probability) • Edge uncertainty (existence probabilitygiven its two endpoints) 8

  9. Problem Definition • Probabilistic subgraph search • Possible worlds: combination of all uncertain edges and vertices 9

  10. Problem Definition • Probabilistic subgraph search • Given: an uncertain graph database G={g1,g2,…,gn}, query graph q and probability threshold  • Query: find all gi ∈G, such that the subgraph isomorphic probability is not smaller than. • Subgraph isomorphic probability (SIP): The SIP between q and gi = the sum of the probabilities of gi’s possible worlds to which q is subgraph isomorphic 10

  11. Problem Definition • Probabilistic subgraph search • Subgraph isomorphic probability (SIP): g q + + + = 0.27 + It is #P-complete to calculate SIP 11

  12. Outline Ⅰ Query Processing Framework Problem Definition Conclusions Solutions Background Ⅱ Ⅲ Ⅳ V

  13. Query Processing Framework • Probabilistic subgraph query processing framework • Naïve method:sequence scan D, and decide if the SIP between q and gi is not smaller than threshold . • g1 subgraph isomorphic to g2 : NP-Complete • Calculating SIP: #P-Complete • Naïve method: very costly, infeasible! 13

  14. Query Processing Framework • Probabilistic subgraph query processing framework • Filter-and-Verification {g1,g2,..,gn} {g’1,g’2,..,g’m} Candidates Filtering Query q {g”1,g”2,..,g”k} Answers Verification 14

  15. Outline Ⅰ Query Processing Framework Problem Definition Conclusions Solutions Background Ⅱ Ⅲ Ⅳ V

  16. Solutions • Filtering: structural pruning • Principle: if we remove all the uncertainty from g, and the resulting graph still does not contain q,then the original uncertain graph cannot contain q. • Theorem: if qgc,then Pr(qg)=0 q g 16

  17. Solutions • Probabilistic pruning: let f be a feature of gc i.e., fgc • Rule 1: if f  q , UpperB(Pr(fg))<,then gis pruned. ∵ f q, ∴ Pr(qg)Pr(fg)< query &  Uncertain graph feature 17

  18. Solutions • Rule 2: if qf, LowerB(Pr(fg)),then gis an answer. ∵ q f, ∴ Pr(qg)Pr(fg) • Two main issues for probabilistic pruning: • How to derive lower and upper bounds of SIP? • How to select features with great pruning power? query &  Uncertain graph feature

  19. Solutions • Technique 1: calculation of lower and upper bounds • Lemma: let Bf1,…,Bf|Ef|beall embeddings of fin gc, then Pr(fg)=Pr(Bf1…Bf|Ef|). • UpperB(Pr(fg)): 19

  20. Solutions • Technique 1: calculation of lower and upper bounds • LowerB(Pr(fg)): • Tightest LowerB(f) Converting into computing the maximum weight clique of graph bG, NP-hard. 20

  21. Solutions • Technique 1: calculation of lower and upper bounds • Exact value V.S. Upper and lower bound Value Computing time 21

  22. Solutions • Technique2: optimal feature selection • If we index all features, we will have the most pruning power index. But it is also very costly to query such index. Thus we would like to select a small number of features but with the greatest pruning power. • Cost model: • Max gain= sequencescan cost– query index cost Maximum set coverage: NP-complete;use the greedy algorithm to approximate it. 22

  23. Solutions • Technique2: optimal feature selection • Maximum converge:greedy algorithm Approximate optimal index within 1-1/e Feature Matrix Probabilistic Index 23

  24. Solutions • Probabilistic Index • Construct a string for each feature • Construct a prefix tree for all feature strings • Construct an invert list for all leaf nodes 24

  25. Solutions • Verification: Iterative bound pruning • Lemma: Pr(qg)=Pr(Bq1…Bq|Eq|) • Unfolding:  • Let • Based on Inclusion-Exclusion Principle Iterative bound pruning 25

  26. Solutions • Performance Evaluation • Real dataset: uncertain PPI • 1500uncertain graphs • Average 332 vertices and 584edges • Average probability:0.367 • Synthetic dataset: AIDSdataset • Generate probabilities using Gaussian distribution • 10kuncertain graphs • Average 24.3vertices and 26.5edges 26

  27. Solutions • Performance Evaluation • Results on real dataset 27

  28. Solutions • Performance Evaluation • Results on real dataset 28

  29. Solutions • Performance Evaluation • Response and Construction time 29

  30. Solutions • Performance Evaluation • Results on synthetic dataset Mean Variance 30

  31. Outline Ⅰ Query Processing Framework Problem Definition Conclusions Solutions Background Ⅱ Ⅲ Ⅳ V

  32. Conclusion • We propose the first efficient solution to answer threshold-based probabilistic sub-graph search over uncertain graph databases. • We employ a filter and verification framework, and develop probability bounds for filtering. • We design a cost model to select minimum number of features with the largest pruning ability. • We demonstrate the effectiveness of our solution through experiments on real and synthetic data sets. 32

  33. Thanks!

More Related