270 likes | 298 Views
Distance-Constraint Reachability Computation in Uncertain Graphs. Ruoming Jin, Lin Liu Kent State University. Bolin Ding UIUC. Haixun Wang MSRA. Why Uncertain Graphs?. Increasing importance of graph/network data.
E N D
Distance-Constraint Reachability Computation in Uncertain Graphs Ruoming Jin, Lin LiuKent State University Bolin Ding UIUC Haixun Wang MSRA
Why Uncertain Graphs? Increasing importance of graph/network data Social Network, Biological Network, Traffic/Transportation Network, Peer-to-Peer Network Probabilistic perspective gets more and more attention recently. Uncertainty is ubiquitous! Protein-Protein Interaction Networks Social Networks Probabilistic Trust/Influence Model False Positive > 45%
Uncertain Graph Model Edge Independence • Possible worlds (2#Edge) Existence Probability G1: G2: Weight of G2: Pr(G2) = 0.5 0.7 0.2 0.6 (1-0.5) * * * * (1-0.4) (1-0.9) (1-0.1) (1-0.3) = 0.0007938 * * * *
Distance-Constraint Reachability (DCR) Problem Given distance constraint d and two vertices s and t, • What is the probability that s can reach t within distance d? • A generalization of the two-terminal network reliability problem, which has no distance constraint. Target Source
Important Applications • Peer-to-Peer (P2P) Networks • Communication happens only when node distance is limited. • Social Networks • Trust/Influence can only be propagated only through small number of hops. • Traffic Networks • Travel distance (travel time) query • What is the probability that we can reach the airport within one hour?
Example: Exact Computation • d = 2, ? First Step: Enumerate all possible worlds (29), Pr(G1) Pr(G2) Pr(G3) Pr(G4) Second Step: Check for distance-constraint connectivity, … + Pr(G1) * 0 + Pr(G2) * 1 + Pr(G3) * 0 + Pr(G4) * 1 + … =
Approximating Distance-Constraint Reachability Computation • Hardness • Two-terminal network reliability is #P-Complete. • DCR is a generalization. • Our goal is to approximate through Sampling • Unbiased estimator • Minimal variance • Low computational cost
Direct Sampling Approach • Sampling Process • Sample n graphs • Sample each graph according to edge probability
Direct Sampling Approach (Cont’) • Estimator • Unbiased • Variance = 1, s reach t within d; = 0, otherwise. Indicator function
Path-Based Approach • Generate Path Set • Enumerate all paths from s to twith length ≤ d • Enumeration methods • E.g., DFS
Path-Based Approach (Cont’) • Path set • Exactly computed by Inclusion-Exclusion principle • Approximated by Monte-Carlo Algorithm by R. M. Karp and M. G. Luby ( ) • Unbiased • Variance
Divide-and-Conquer Methodology • Example +(s,a) -(s,a) -(s,b) +(s,b) -(a,t) +(a,t) … … … … … …
Divide and Conquer (Cont’) Summarize: • # of leaf nodes is smaller than 2|E| . • Each possible world exists only in one leaf node. • Reachability is the sum of the weights of blue nodes. • Leaf nodes form a nice sample space. all possible worlds Graphs having e1 Graphs not Having e1 s can reach t. s can not reach t.
How do we sample? Start from here • Unequal probability sampling • Hansen-Hurwitz (HH) estimator • Horvitz-Thomson (HT) estimator Pri: Sample Unit Weight; Sum of possible worlds’ probabilities in the node. qi: sampling probability, determined by properties of coins along the way. Sample Unit
Hansen-Hurwitz (HH) Estimator sample size = 1, blue node = 0, red node • Estimator • Unbiased • Variance Weight Sampling probability To minimize the variance above, we have :Pri = qi Pri = p(e1)*p(e2)*(1-p(e3))*… Pri: the leaf node weight qi: the sampling probability P(e1) 1-P(e1) p(e1) : 1 – p(e1) P(e2) 1-P(e2) 1-P(e4) P(e4) 1-P(e3) p(e2) : 1 – p(e2) P(e3) p(e3) : 1 – p(e3)
Horvitz-Thomson (HT) Estimator # of Unique sample units • Estimator • Unbiased • Variance • To minimize vairance, we find Pri = qi • Smaller variance than HH estimator
Recursive Estimator • Unbiased • Variance: n1 + n2 = n Sample the entire space n times Sample the sub-space n1 times Sample the sub-space n2 times We can not minimize the variance without knowing τ1 and τ2. Then what can we do?
Sample Allocation • We guess: What if • n1 = n*p(e) • n2 = n*(1-p(e))? • We find: Variance reduced! • HH Estimator: • HT Estimator:
Sample Allocation (Cont’) Sample size = n • Sampling Time Reduced!! Directly allocate samples n1=n*p(e1) n2=n*(1-p(e1)) n3=n1*p(e2) n4=n1*(1-p(e2)) Toss coin when sample size is small
Experimental Setup • Experiment setting • Goal: • Relative Error • Variance • Computational Time • System Specification • 2.0GHz Dual Core AMD Opteron CPU • 4.0GB RAM • Linux
Experimental Results • Synthetic datasets • Erdös-Rényi random graphs • Vertex#: 5000, edge density: 10, Sample size: 1000 • Categorized by extracted-subgraph size (#edge) • For each category, 1000 queries
Experimental Results • Real datasets • DBLP: 226,000 vertices, 1,400,000 edges • Yeast PPIN: 5499 vertices, 63796 edges • Fly PPIN: 7518 vertices, 51660 edges • Extracted subgraphs size: 20 ~ 50 edges
Conclusions • We first propose a novel s-t distance-constraint reachability problem in uncertain graphs. • One efficient exact computation algorithm is developed based on a divide-and-conquer scheme. • Compared with two classic reachability estimators, two significant unequal probability sampling estimatorsHansen-Hurwitz (HH) estimator and Horvitz-Thomson (HT) estimator. • Based on the enumeration tree framework, two recursive estimators Recursive HH, and Recursive HT are constructed to reduce estimation variance and time. • Experiments demonstrate the accuracy and efficiency of our estimators.
Thank you ! Questions?