540 likes | 680 Views
Measuring and Extracting Proximity in Complex Networks. Emden Gansner, Yehuda Koren, Stephen North , Chris Volinsky AT&T Labs Research. AT&T “Safe Harborâ€.
E N D
Measuring and ExtractingProximity in Complex Networks Emden Gansner, Yehuda Koren, Stephen North, Chris VolinskyAT&T Labs Research
AT&T “Safe Harbor” The following contains "forward-looking statements" which are based on management's beliefs as well as on a number of assumptions concerning future events made by and information currently available to management. Readers are cautioned not to put undue reliance on such forward-looking statements, which are not a guarantee of performance and are subject to a number of uncertainties and other factors, many of which are outside AT&T's control, that could cause actual results to differ materially from such statements. For a more detailed description of the factors that could cause such a difference, please see AT&T's filings with the Securities and Exchange Commission. AT&T disclaims any intention or obligation to update or revise any forward-looking statements, whether as a result of new information, future events or otherwise.
large social networks data source |V| |E|
Connecting Co-Authors in DBLP 18 node subgraph Proximity: 1.35e+01 Captured: 1.31e+01(97%) Adam?Glenn?Emden?
95% of communication between… 5 node subgraph Proximity: 7.10e+00 Captured: 6.74e+00(95%(
Our goals • Measure proximity between nodes. • Explain proximity by extracting connection subgraphs that are readily visualized.
What is proximity? • proximity [prox·im·i·ty || prɑk'sɪmətɪ /prɒ-]n. adjacency, nearness, closeness, vicinity • Network proximity is an elusive notion! • Let’s work by refining a series of definitions.
Measuring proximity • Simplest approach – length of shortest path • Easily visualized
Measuring proximity • Simplest approach – length of shortest path • Easily illustrated • Disregards alternative paths Captures 56% Captures 98%
Measuring proximity • Simplest approach – length of shortest path • Easily visualized • Disregards alternative paths • Naïve calculation will be fooled by high degreesExample from a telephone call graph…
Meaningful connection Shankar Suresh Stephen Lefty Which pair is closer? • Both paths are 2-hops, about the same lengths • But when considering node-degrees… Random connection?
Measuring proximity – 2nd try • Net network flow between the nodes • Accounts for multiple paths • Distance indifferent – might favor long paths • High degree are still an issue
Measuring proximity – 3rd try • Delivered electric current(effective conductance) • Resistor network model • Accounts for multiple paths • Penalizes long paths • High degrees?? • Getting us closer… • “intuitive” • Physical analogy is not perfect! 1V 0V edge weights conductance, inverse-resistance
When is the electrical current analogy misleading? Significant connection Noise? What does current flow mean?
When is the electric current analogy misleading? • Same current flow in both cases! • Degree-1 nodes are neutral (attract no flow) • Degree-1 nodes are very common, due to incomplete information Significant connection Noise?
Augment network by a universal sink[Faloutsos, McCurley & Tomkins, KDD 2004] • Connect all nodes to a grounded universal sink (with 0V) • Tax each node - deliver portion of the flow to the sink • No internal nodes of degree 1 (above problem solved) • Penalizes long paths • A new parameter to worry about:Which tax system? - Constant tax? Proportional tax? Tax brackets? How much? • There is a worse problem…
Universal sink and (non-)monotonicity • In our previous notions of proximity, adding nodes/edges to the network couldn’t decrease proximity • Hmmm…this “blind monotonicity” was part of their shortcoming… Proximity Network size
Universal sink and (non-)monotonicity • For all previous measures, adding nodes/edges to the network couldn’t decrease proximity • With universal sink – no monotonicity:Larger network proximity tends to zero, sink attracts more flow • Even adding s—t paths can decrease proximity! Proximity Network size
Universal sink and (non-)monotonicity • Problems with non-monotonicity: • Counter-intuitive and hard to use • Size bias makes proximity-comparison across different pairs completely unreliable • Impossible to explain (size-dependent) proximity using a connection subgraph Proximity Network size
A random-walk perspective • Current-flow model has a direct r.w. interpretation • Reminder:We defined proximity by “delivered current” or “effective conductance” • The escape probability, Pesc(st), is the probability that a r.w. originating at s will reacht before visiting sagain • Let Deg(s) be the number of r.w.’s originating at s • The effective conductance between s and t, is Pesc(st)*Deg(s)
“Dead end” paths have no influence on escape probability • Both graphs have the same escape-probability from red to green Lowerredgreen escape probability Higherredgreen escape probability In both cases higher effective conductance by Rayleigh’s Monotonicity Law
Extending escape probability • The escape probability, Pesc(st), is the probability that a r.w. originating at s will reacht before visiting sagain • The cycle-free escape probability, Pc.f.esc(st) is the probability that a r.w. originating at s will reacht without visiting any node more than once • Multiply by degree to get an absolute quantity (accounting for the number of "actually initiated" r.w.'s):The c.f. effective conductance between s and t is Pc.f.esc(st)*Deg(s)
The c.f. effective conductance is a good candidate proximity measure: • Accounts for multiple paths • Favors short paths • Penalizes high-degree nodes • Penalizes dead-end paths • Parameter free • Has the “right” monotonicity • Accommodates edge directions • Has a natural extension to multiple endpoints Higherredgreen c.f. escape probability Lowerredgreen c.f. escape probability
Computing c.f. escape probability • Unlike previous measures, exact computation is impossible • Practically, we can estimate it extremely well • Probability of paths declines exponentially (e.g., 100th path is x106 less probable than the first one.) • Estimate using the most probable paths: =
Finding k most probable paths • For an edge u-v of weight w(u,v), define its length • Edge lengths are positive • Exp(-<length of path>) = Prob(path) • Short path High-probable path • Compute k shortest simple paths in O(k|E|log|E|) time [Katoh, Ibarki and Mine, 1982] • Stop searching when probability drops below “10-6”of first path
Extracting proximity • Cycle free effective conductance (CFEC) depends on the full graph • Find a small subgraph that captures the most proximity • A tradeoff between “size” and “captured proximity”, can be expressed in alternative ways: • Extract a subgraph with at most Bnodes that captures maximal CFEC • Maybe with B+1 nodes we can capture much more??? • Extract a minimal-sized subgraph that captures at least P% of total CFEC • Maybe we can capture (P-1)%of total CFEC with a much smaller subgraph???
Extracting proximity • Find a small subgraph that captures most proximity • Achieve an efficient balance between “size” and “proximity” by maximizing the ratio: • Larger α emphasize proximity larger subgraph • α=0 returns only the shortest path • α=∞ return all paths • Optionally, explicitly fix lower and upper bounds on subgraph size
What solutions do we seek? • Overlapping paths delivering the most flow
The path merger algorithm • We already have a collection of paths • Find the subset of the paths that maximizes • Combine the selected paths into a “proximity subgraph” • Overlapping paths are cheaper to add • An NP-hard problem…
Optimal algorithm • Scanning all subsets takes O(2k)time (can we do better?) • A branch-and-bound pruning significantly reduces running time • Huge deviations in path-quality make this approach effectivee.g. often it is clear that the best-subset must contain first path(s) • Prematurely terminate exponential algorithm after scanning “too many” subsets
Agglomerative algorithm • If optimal algorithm couldn’t finish, improve current result by an agglomerative algorithm • Iteratively, merge the two subsets that maximize the ratio • Record the best subset discovered
Working with large graphs in external storage • Dealing with full graph is sometimes infeasible and usually unnecessary • Prior to running the algorithm, we construct a candidate graph in main memory • We begin by growing increasing neighborhoods around the endpoints
S T
Dist(S,i)=2 Dist(T,i)=2 S T
Dist(S,i)=3 Dist(T,i)=3 S T
Dist(S,i)=4 Dist(T,i)=4 S T
Dist(S,i)=5 Dist(T,i)=5 S T Shortest path of length 10
Most probable path of length 10 was found No use for low-probability paths... Paths longer than “24” unneeded!
S T Dist(S,i)=12 Dist(T,i)=12
S T i Dist(S,i)=12 Dist(T,i)=12 • Stop adding nodes • Any s—t path through unscanned node must be longer than “24”, thus useless • Can we prune the resulting graph? • Yes! • From two circles into an ellipse…
Pruning the candidate graph • We can safely prune a significant portion of the candidate graph • Use the fact: dist(i,s)+dist(i,t)>L all s—tpaths going via i are longer than L • We ignore much less probable pathsPaths longer than “24” are not interesting • Take only nodes within the ellipse defined by:dist(i,s)+dist(i,t)<24
From2-centers of circlesto2-foci of ellipse Dist(S,i)+Dist(T,i)=24 S T Dist(S,i)=12 Dist(T,i)=12