What is Proximity?

Measuring and ExtractingProximity in NetworksYehuda Koren, Stephen North and Chris VolinskyKDD 2006Philadelphia

What is Proximity? • proximity n. adjacency, nearness, closeness, vicinity • What is the distance (closeness) between two nodes in a social network?

Why proximity? Proximity measures… • Social “closeness” • similarity • Information flow …and can be used for… • Missing Data • Link Prediction …in the following applications… • Fraud detection • Viral marketing • Identifying clusters

Our Goals • Measure and visualize proximity between nodes. • Measurement should have the following qualities: • “Close” nodes are intuitive • Short graph distance • Multiple paths • High weights on edges • Low degree nodes in the paths • Monotonicity • Generalizes to n > 2.

Our goals • Explain proximity by extracting proximity subgraphs that are readily visualized and contain a large percentage of overall proximity. • Idea comes from “connection subgraphs” (Faloutsos, McCurley and Tomkins 2004). Prox = .0048 Prox = .0053

Measuring proximity • Many proposals in the literature (n.b. Liben-Nowell and Kleinberg 2003) • Graph distance: shortest path • Doesn’t account for path length, multiple paths, or high-degree nodes (A,D) • Maximum Network Flow • Disregards path length, high degree nodes (A,B) (A,C) • depends on bottlenecks ( B,E) • Electrical networks, or “effective conductance” (e.g. Doyle and Snell 1984) • High degree nodes still a problem

When is the electric current analogy misleading? • Same current-flow in both cases! • Degree-1 nodes are neutral Significant connection Noise?

Sink- augmented effective conductance [Faloutsos, McCurley & Tomkins, KDD 2004] • Connect all nodes to a grounded universal sink (with 0V) • Tax each node - deliver portion of the flow to the sink • No nodes of degree 1 (above problem solved) • Penalizes long paths • How do we set taxing system? • Doesn’t generalize to n > 2 • No monotonicity…

Universal sink and (non-)monotonicity With universal sink – no monotonicity: • For larger networks,proximity tends to zero creating a “size bias”. • Adding s—t paths can either increase or decrease proximity! Proximity Network size

Electrical networks = random walks • Current-flow notions have direct random walk interpretation • Take a random walk starting at s, following edges of the graph proportional to their weight (conductance). • Let D(s), the degree of s, be the number of random walks originating at s. Then: • The escape probability, EP(st), is the probability that a walk originating at s will reacht before visiting sagain , and • The effective conductance between s and t: EC(s,t) = EP(st) * Deg(s)

Electrical networks = random walks With the random walk perspective, you can see that the 1-degree nodes have no influence. By discouraging “backtracking”, we now can properly account for high degree nodes

Our proximity: cycle free effective conductance • The cycle-free escape probability, CFEP(st) is the probability that a random walk originating at s will reachtwithout visiting any node more than once • Multiplying by degree of the source gives an absolute quantity (accounting for the number of "actually initiated" walks): • The cycle-free effective conductance between s and t:CFEC(s,t) = CFEP(st) * Deg(s)

Higherredgreen c.f. escape probability Lowerredgreen c.f. escape probability • Properties of CFEC as a proximity measure: • Favors multiple paths • Favors short paths • Penalizes high-degree nodes • Penalizes dead-end paths • Parameter free • Has the “right” monotonicity • Accommodates edge directions • Has a natural extension to n > 2

Computing CFEC • Unlike previous measures, exact computation is impossible. Let SP(s,t) be the set of all simple paths: • Note that the probability of paths declines exponentially (e.g., 100th path is x106 less probable than the first one.) • Estimate using the set of most probable paths, SP’: Finding k shortest simple paths takes O(k|E|log|E|) time [Katoh, Ibarki and Mine, 1982]

Extracting proximity graphs Recall FMT’04 “connection subgraphs”, the small subgraph that best captures the connections between two nodes of the graph

Extracting proximity graphs • Achieve an efficient balance between “size” and “proximity” by maximizing the ratio: • Larger a  emphasize proximity  larger subgraph • a =0  return shortest path • a=∞ return all paths

Extracting proximity graphs • We already have the collection, Rk of shortest paths {P1,P2,…,Pk} • Find the subset of the paths that maximizes … and combine the selected paths into a “proximity graph” • This is an NP-hard problem, but recall that we have a list of paths sorted by probability • Use a branch and bound path merging algorithm

Proximity Graph N ~ 20 Working with large graphs • Dealing with full graph is sometimes infeasible and usually unnecessary • Prior to running the algorithm, we construct a candidate graph in main memory (also FMT ’04). full network N ~ 350M Candidate graph N ~ 10,000

Finding the candidate graph S T

Dist(S,i)=2 Dist(T,i)=2 S T

Dist(S,i)=5 Dist(T,i)=5 S T Shortest path of length 10

S T i Dist(S,i)=12 Dist(T,i)=12 • Once we have this candidate graph, apply CFEC algorithm to extract proximity graph. • Stop adding nodes when path probabilities are below e • Any path through unscanned node is likely to be low probability

Summary: Proximity Graphs • We have a measure of proximity which fulfills our desired criteria • Intuitive sense of closeness • Generalizes to n>2 • Parameter free • Using this measure of proximity we can efficiently extract the proximity graph. • Let’s apply to real data

Application: call detail • AT&T’s call detail graph is large (350M nodes, several billion edges). • To calculate proximity, we just need an adjacency list • Dynamic, efficient creation of adjacency lists for transaction graphs (Cortes, Pregibon, and Volinsky 2003) • Select a random sample of 2000 residential TNs and calculate proximity between them. • We found a path for 1808 of them • For those that we found a path, we calculated proximity, and rendered a proximity graph for them.

Distribution of proximities in phone-call network

Application: call detail • Capturing proximity in a proximity graph…. • Studying a • Low alpha: smaller graphs, less proximity captured. a = 10 seems to give a good tradeoff

#Graphs %Captured Proximity Size of graph

Proximity as link predictor • Calculate proximities for a sample of pairs in the network that have never communicated. • Look in the future to see which of these communicate in the next time period t. • Did those that eventually communicate have closer proximities. • i.e. is proximity predictive of future communication?

Proximity as link predictor Mean log proximity: Communicators = -2.4 Non-comm. = -5.9

Using Visualization • Different Visualizations bring out different aspects of the proximity graph, especially for n>2.

So who’s closer? + 1.20 - 0.53 - 3.02

Using a hierarchical layout for n=2 shows different eras of movie stars

Prox webpage http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl

Summary • Proposedcycle free effective conductance (CFEC) with a random walk interpretation to measure “proximity” in social networks and other ad-hoc networks Extensions • Compare to other proximity measures (Katz, PageRank, and other methods compared in Liben-Nowell and Kleinberg (2003)) • Quantify proximity across different kinds of networks • Extend CFEC to directed edges (Hanghang Tong) Try it out! http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl

Extensions http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl

What is Proximity?