Ye Yuan 1 , Guoren Wang 1 , Haixun Wang 2 , Lei Chen 3

Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan1, Guoren Wang1, Haixun Wang2, Lei Chen3 1. Northeastern University, China 2. Microsoft Resarch Asia 3. HKUST

Outline Ⅰ Query Processing Framework Problem Definition Solutions Conclusions Background Ⅱ Ⅲ Ⅳ V

Background • Graph is a complicated data structure, and has been used in many real applications. • Bioinformatics Gene regulatory networks Yeast PPI networks 3

Background • Compounds benzene ring Compounds database 4

Background • Social Networks Web2.0 EntityCube 5

Therefore, it is important to study query processing on large uncertain graphs. Background • In these applications, graph data may be noisy and incomplete, which leads to uncertain graphs. • STRING database (http://string-db.org) is a data source that contains PPIs with uncertain edges provided by biological experiments. • Visual Pattern Recognition, uncertain graphs are used to model visual objects. • Social networks, uncertain links used to represent possible relationships or strength of influence between people. 6

Outline Ⅰ Query Processing Framework Conclusions Problem Definition Solutions Background Ⅱ Ⅲ Ⅳ V

Problem Definition • Probabilistic subgraph search • Uncertain graph： • Vertex uncertainty (existence probability) • Edge uncertainty (existence probabilitygiven its two endpoints) 8

Problem Definition • Probabilistic subgraph search • Possible worlds: combination of all uncertain edges and vertices 9

Problem Definition • Probabilistic subgraph search • Given: an uncertain graph database G={g1,g2,…,gn}, query graph q and probability threshold  • Query: find all gi ∈G, such that the subgraph isomorphic probability is not smaller than. • Subgraph isomorphic probability (SIP): The SIP between q and gi = the sum of the probabilities of gi’s possible worlds to which q is subgraph isomorphic 10

Problem Definition • Probabilistic subgraph search • Subgraph isomorphic probability (SIP): g q + + + = 0.27 + It is #P-complete to calculate SIP 11

Outline Ⅰ Query Processing Framework Problem Definition Conclusions Solutions Background Ⅱ Ⅲ Ⅳ V

Query Processing Framework • Probabilistic subgraph query processing framework • Naïve method：sequence scan D, and decide if the SIP between q and gi is not smaller than threshold . • g1 subgraph isomorphic to g2 : NP-Complete • Calculating SIP: #P-Complete • Naïve method: very costly, infeasible！ 13

Query Processing Framework • Probabilistic subgraph query processing framework • Filter-and-Verification {g1,g2,..,gn} {g’1,g’2,..,g’m} Candidates Filtering Query q {g”1,g”2,..,g”k} Answers Verification 14

Solutions • Filtering: structural pruning • Principle: if we remove all the uncertainty from g, and the resulting graph still does not contain q,then the original uncertain graph cannot contain q. • Theorem: if qgc，then Pr(qg)=0 q g 16

Solutions • Probabilistic pruning: let f be a feature of gc i.e., fgc • Rule 1： if f  q , UpperB(Pr(fg))<，then gis pruned. ∵ f q, ∴ Pr(qg)Pr(fg)< query &  Uncertain graph feature 17

Solutions • Rule 2： if qf, LowerB(Pr(fg))，then gis an answer. ∵ q f, ∴ Pr(qg)Pr(fg) • Two main issues for probabilistic pruning： • How to derive lower and upper bounds of SIP? • How to select features with great pruning power? query &  Uncertain graph feature

Solutions • Technique 1: calculation of lower and upper bounds • Lemma： let Bf1,…,Bf|Ef|beall embeddings of fin gc, then Pr(fg)=Pr(Bf1…Bf|Ef|). • UpperB(Pr(fg)): 19

Solutions • Technique 1: calculation of lower and upper bounds • LowerB(Pr(fg)): • Tightest LowerB(f) Converting into computing the maximum weight clique of graph bG, NP-hard. 20

Solutions • Technique 1: calculation of lower and upper bounds • Exact value V.S. Upper and lower bound Value Computing time 21

Solutions • Technique2: optimal feature selection • If we index all features, we will have the most pruning power index. But it is also very costly to query such index. Thus we would like to select a small number of features but with the greatest pruning power. • Cost model: • Max gain= sequencescan cost– query index cost Maximum set coverage: NP-complete;use the greedy algorithm to approximate it. 22

Solutions • Technique2: optimal feature selection • Maximum converge：greedy algorithm Approximate optimal index within 1-1/e Feature Matrix Probabilistic Index 23

Solutions • Probabilistic Index • Construct a string for each feature • Construct a prefix tree for all feature strings • Construct an invert list for all leaf nodes 24

Solutions • Verification: Iterative bound pruning • Lemma： Pr(qg)=Pr(Bq1…Bq|Eq|) • Unfolding:  • Let • Based on Inclusion-Exclusion Principle Iterative bound pruning 25

Solutions • Performance Evaluation • Real dataset: uncertain PPI • 1500uncertain graphs • Average 332 vertices and 584edges • Average probability:0.367 • Synthetic dataset： AIDSdataset • Generate probabilities using Gaussian distribution • 10kuncertain graphs • Average 24.3vertices and 26.5edges 26

Solutions • Performance Evaluation • Results on real dataset 27

Solutions • Performance Evaluation • Results on real dataset 28

Solutions • Performance Evaluation • Response and Construction time 29

Solutions • Performance Evaluation • Results on synthetic dataset Mean Variance 30

Conclusion • We propose the first efficient solution to answer threshold-based probabilistic sub-graph search over uncertain graph databases. • We employ a filter and verification framework, and develop probability bounds for filtering. • We design a cost model to select minimum number of features with the largest pruning ability. • We demonstrate the effectiveness of our solution through experiments on real and synthetic data sets. 32

Thanks!

Ye Yuan 1 , Guoren Wang 1 , Haixun Wang 2 , Lei Chen 3

Ye Yuan 1 , Guoren Wang 1 , Haixun Wang 2 , Lei Chen 3

Presentation Transcript

Ye Yuan 1 , Guoren Wang 1 , Haixun Wang 2 , Lei Chen 3

Ran Chen Yubao Wang Liang Guo

Jiating Chen 1,2 , Bin Wang 1 1 School of Software, Tsinghua University

Ye Zhang 1 , Jianying Jiao 1 , Juraj Irsa 2 , Dongdong Wang 1 yzhang9@uwyo

Tolga Can and Yuan-Fang Wang

Chen LIN * , Jiang-Ming YANG + , Rui CAI + , Xin-jing WANG + , Wei WANG * , Lei ZHANG +

Petr Bureš 1) , Wang Yi-Feng 2) , Lucie Horová 1) , Jan Suda 3)

Chieh-Hung Chen , Chung-Ho Wang,

Chang-Po Chen, Fang-Lin Wang

Supplementary Table 1; Wong, Wang, Smith, Reddy and Chen

C.P. Chen 1,2,3 , S.T. Lin 2,3 , F.L. Wang 1

Xin Li 1 , Yuan Wang 1 , Jie Ming 1 , Kun Zhao 1 , Ming Xue 2

Jiating Chen 1,2 , Bin Wang 1 1 School of Software, Tsinghua University

Ting-Yuan Wang Charlie Chung-Ping Chen Electrical and Computer Engineering

Jen-Chi Hu 1 , Wann-Jin Chen 1 , Christine Chiu 2 , Gin-Rong Liu 3 and Jian-Liang Wang 1

Likun Wang 1* , Yong Han 2 , Yong Chen 1 , Denis Tremblay 3 , Xin Jin 4

I-Chen Wang

Dong Chen and Xiaoming Wang

Wang

Wang

Ting-Yuan Wang Charlie Chung-Ping Chen Electrical and Computer Engineering

Huang Cheng 1 , Wang Hongli 1 , Li Li 1 , Lu Qing 1 , Wang Qian 1 , J. A. de Gouw 2