330 likes | 503 Views
Efficient Subgraph Search over Large Uncertain Graphs. Ye Yuan 1 , Guoren Wang 1 , Haixun Wang 2 , Lei Chen 3. 1. Northeastern University, China 2. Microsoft Resarch Asia 3. HKUST. Outline. Ⅰ. Query Processing Framework. Problem Definition. Solutions. Conclusions. Background. Ⅱ. Ⅲ.
E N D
Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan1, Guoren Wang1, Haixun Wang2, Lei Chen3 1. Northeastern University, China 2. Microsoft Resarch Asia 3. HKUST
Outline Ⅰ Query Processing Framework Problem Definition Solutions Conclusions Background Ⅱ Ⅲ Ⅳ V
Background • Graph is a complicated data structure, and has been used in many real applications. • Bioinformatics Gene regulatory networks Yeast PPI networks 3
Background • Compounds benzene ring Compounds database 4
Background • Social Networks Web2.0 EntityCube 5
Therefore, it is important to study query processing on large uncertain graphs. Background • In these applications, graph data may be noisy and incomplete, which leads to uncertain graphs. • STRING database (http://string-db.org) is a data source that contains PPIs with uncertain edges provided by biological experiments. • Visual Pattern Recognition, uncertain graphs are used to model visual objects. • Social networks, uncertain links used to represent possible relationships or strength of influence between people. 6
Outline Ⅰ Query Processing Framework Conclusions Problem Definition Solutions Background Ⅱ Ⅲ Ⅳ V
Problem Definition • Probabilistic subgraph search • Uncertain graph: • Vertex uncertainty (existence probability) • Edge uncertainty (existence probabilitygiven its two endpoints) 8
Problem Definition • Probabilistic subgraph search • Possible worlds: combination of all uncertain edges and vertices 9
Problem Definition • Probabilistic subgraph search • Given: an uncertain graph database G={g1,g2,…,gn}, query graph q and probability threshold • Query: find all gi ∈G, such that the subgraph isomorphic probability is not smaller than. • Subgraph isomorphic probability (SIP): The SIP between q and gi = the sum of the probabilities of gi’s possible worlds to which q is subgraph isomorphic 10
Problem Definition • Probabilistic subgraph search • Subgraph isomorphic probability (SIP): g q + + + = 0.27 + It is #P-complete to calculate SIP 11
Outline Ⅰ Query Processing Framework Problem Definition Conclusions Solutions Background Ⅱ Ⅲ Ⅳ V
Query Processing Framework • Probabilistic subgraph query processing framework • Naïve method:sequence scan D, and decide if the SIP between q and gi is not smaller than threshold . • g1 subgraph isomorphic to g2 : NP-Complete • Calculating SIP: #P-Complete • Naïve method: very costly, infeasible! 13
Query Processing Framework • Probabilistic subgraph query processing framework • Filter-and-Verification {g1,g2,..,gn} {g’1,g’2,..,g’m} Candidates Filtering Query q {g”1,g”2,..,g”k} Answers Verification 14
Outline Ⅰ Query Processing Framework Problem Definition Conclusions Solutions Background Ⅱ Ⅲ Ⅳ V
Solutions • Filtering: structural pruning • Principle: if we remove all the uncertainty from g, and the resulting graph still does not contain q,then the original uncertain graph cannot contain q. • Theorem: if qgc,then Pr(qg)=0 q g 16
Solutions • Probabilistic pruning: let f be a feature of gc i.e., fgc • Rule 1: if f q , UpperB(Pr(fg))<,then gis pruned. ∵ f q, ∴ Pr(qg)Pr(fg)< query & Uncertain graph feature 17
Solutions • Rule 2: if qf, LowerB(Pr(fg)),then gis an answer. ∵ q f, ∴ Pr(qg)Pr(fg) • Two main issues for probabilistic pruning: • How to derive lower and upper bounds of SIP? • How to select features with great pruning power? query & Uncertain graph feature
Solutions • Technique 1: calculation of lower and upper bounds • Lemma: let Bf1,…,Bf|Ef|beall embeddings of fin gc, then Pr(fg)=Pr(Bf1…Bf|Ef|). • UpperB(Pr(fg)): 19
Solutions • Technique 1: calculation of lower and upper bounds • LowerB(Pr(fg)): • Tightest LowerB(f) Converting into computing the maximum weight clique of graph bG, NP-hard. 20
Solutions • Technique 1: calculation of lower and upper bounds • Exact value V.S. Upper and lower bound Value Computing time 21
Solutions • Technique2: optimal feature selection • If we index all features, we will have the most pruning power index. But it is also very costly to query such index. Thus we would like to select a small number of features but with the greatest pruning power. • Cost model: • Max gain= sequencescan cost– query index cost Maximum set coverage: NP-complete;use the greedy algorithm to approximate it. 22
Solutions • Technique2: optimal feature selection • Maximum converge:greedy algorithm Approximate optimal index within 1-1/e Feature Matrix Probabilistic Index 23
Solutions • Probabilistic Index • Construct a string for each feature • Construct a prefix tree for all feature strings • Construct an invert list for all leaf nodes 24
Solutions • Verification: Iterative bound pruning • Lemma: Pr(qg)=Pr(Bq1…Bq|Eq|) • Unfolding: • Let • Based on Inclusion-Exclusion Principle Iterative bound pruning 25
Solutions • Performance Evaluation • Real dataset: uncertain PPI • 1500uncertain graphs • Average 332 vertices and 584edges • Average probability:0.367 • Synthetic dataset: AIDSdataset • Generate probabilities using Gaussian distribution • 10kuncertain graphs • Average 24.3vertices and 26.5edges 26
Solutions • Performance Evaluation • Results on real dataset 27
Solutions • Performance Evaluation • Results on real dataset 28
Solutions • Performance Evaluation • Response and Construction time 29
Solutions • Performance Evaluation • Results on synthetic dataset Mean Variance 30
Outline Ⅰ Query Processing Framework Problem Definition Conclusions Solutions Background Ⅱ Ⅲ Ⅳ V
Conclusion • We propose the first efficient solution to answer threshold-based probabilistic sub-graph search over uncertain graph databases. • We employ a filter and verification framework, and develop probability bounds for filtering. • We design a cost model to select minimum number of features with the largest pruning ability. • We demonstrate the effectiveness of our solution through experiments on real and synthetic data sets. 32