410 likes | 645 Views
Measuring and Extracting Proximity in Networks Yehuda Koren, Stephen North and Chris Volinsky KDD 2006 Philadelphia. What is Proximity?. proximity  n. adjacency, nearness, closeness, vicinity. What is the distance (closeness) between two nodes in a social network?. Why proximity?.
E N D
Measuring and ExtractingProximity in NetworksYehuda Koren, Stephen North and Chris VolinskyKDD 2006Philadelphia
What is Proximity? • proximity n. adjacency, nearness, closeness, vicinity • What is the distance (closeness) between two nodes in a social network?
Why proximity? Proximity measures… • Social “closeness” • similarity • Information flow …and can be used for… • Missing Data • Link Prediction …in the following applications… • Fraud detection • Viral marketing • Identifying clusters
Our Goals • Measure and visualize proximity between nodes. • Measurement should have the following qualities: • “Close” nodes are intuitive • Short graph distance • Multiple paths • High weights on edges • Low degree nodes in the paths • Monotonicity • Generalizes to n > 2.
Our goals • Explain proximity by extracting proximity subgraphs that are readily visualized and contain a large percentage of overall proximity. • Idea comes from “connection subgraphs” (Faloutsos, McCurley and Tomkins 2004). Prox = .0048 Prox = .0053
Measuring proximity • Many proposals in the literature (n.b. Liben-Nowell and Kleinberg 2003) • Graph distance: shortest path • Doesn’t account for path length, multiple paths, or high-degree nodes (A,D) • Maximum Network Flow • Disregards path length, high degree nodes (A,B) (A,C) • depends on bottlenecks ( B,E) • Electrical networks, or “effective conductance” (e.g. Doyle and Snell 1984) • High degree nodes still a problem
When is the electric current analogy misleading? • Same current-flow in both cases! • Degree-1 nodes are neutral Significant connection Noise?
Sink- augmented effective conductance [Faloutsos, McCurley & Tomkins, KDD 2004] • Connect all nodes to a grounded universal sink (with 0V) • Tax each node - deliver portion of the flow to the sink • No nodes of degree 1 (above problem solved) • Penalizes long paths • How do we set taxing system? • Doesn’t generalize to n > 2 • No monotonicity…
Universal sink and (non-)monotonicity With universal sink – no monotonicity: • For larger networks,proximity tends to zero creating a “size bias”. • Adding s—t paths can either increase or decrease proximity! Proximity Network size
Electrical networks = random walks • Current-flow notions have direct random walk interpretation • Take a random walk starting at s, following edges of the graph proportional to their weight (conductance). • Let D(s), the degree of s, be the number of random walks originating at s. Then: • The escape probability, EP(st), is the probability that a walk originating at s will reacht before visiting sagain , and • The effective conductance between s and t: EC(s,t) = EP(st) * Deg(s)
Electrical networks = random walks With the random walk perspective, you can see that the 1-degree nodes have no influence. By discouraging “backtracking”, we now can properly account for high degree nodes
Our proximity: cycle free effective conductance • The cycle-free escape probability, CFEP(st) is the probability that a random walk originating at s will reachtwithout visiting any node more than once • Multiplying by degree of the source gives an absolute quantity (accounting for the number of "actually initiated" walks): • The cycle-free effective conductance between s and t:CFEC(s,t) = CFEP(st) * Deg(s)
Higherredgreen c.f. escape probability Lowerredgreen c.f. escape probability • Properties of CFEC as a proximity measure: • Favors multiple paths • Favors short paths • Penalizes high-degree nodes • Penalizes dead-end paths • Parameter free • Has the “right” monotonicity • Accommodates edge directions • Has a natural extension to n > 2
Computing CFEC • Unlike previous measures, exact computation is impossible. Let SP(s,t) be the set of all simple paths: • Note that the probability of paths declines exponentially (e.g., 100th path is x106 less probable than the first one.) • Estimate using the set of most probable paths, SP’: Finding k shortest simple paths takes O(k|E|log|E|) time [Katoh, Ibarki and Mine, 1982]
Extracting proximity graphs Recall FMT’04 “connection subgraphs”, the small subgraph that best captures the connections between two nodes of the graph
Extracting proximity graphs • Achieve an efficient balance between “size” and “proximity” by maximizing the ratio: • Larger a emphasize proximity larger subgraph • a =0 return shortest path • a=∞ return all paths
Extracting proximity graphs • We already have the collection, Rk of shortest paths {P1,P2,…,Pk} • Find the subset of the paths that maximizes … and combine the selected paths into a “proximity graph” • This is an NP-hard problem, but recall that we have a list of paths sorted by probability • Use a branch and bound path merging algorithm
Proximity Graph N ~ 20 Working with large graphs • Dealing with full graph is sometimes infeasible and usually unnecessary • Prior to running the algorithm, we construct a candidate graph in main memory (also FMT ’04). full network N ~ 350M Candidate graph N ~ 10,000
Dist(S,i)=2 Dist(T,i)=2 S T
Dist(S,i)=3 Dist(T,i)=3 S T
Dist(S,i)=4 Dist(T,i)=4 S T
Dist(S,i)=5 Dist(T,i)=5 S T Shortest path of length 10
S T i Dist(S,i)=12 Dist(T,i)=12 • Once we have this candidate graph, apply CFEC algorithm to extract proximity graph. • Stop adding nodes when path probabilities are below e • Any path through unscanned node is likely to be low probability
Summary: Proximity Graphs • We have a measure of proximity which fulfills our desired criteria • Intuitive sense of closeness • Generalizes to n>2 • Parameter free • Using this measure of proximity we can efficiently extract the proximity graph. • Let’s apply to real data
Application: call detail • AT&T’s call detail graph is large (350M nodes, several billion edges). • To calculate proximity, we just need an adjacency list • Dynamic, efficient creation of adjacency lists for transaction graphs (Cortes, Pregibon, and Volinsky 2003) • Select a random sample of 2000 residential TNs and calculate proximity between them. • We found a path for 1808 of them • For those that we found a path, we calculated proximity, and rendered a proximity graph for them.
Application: call detail • Capturing proximity in a proximity graph…. • Studying a • Low alpha: smaller graphs, less proximity captured. a = 10 seems to give a good tradeoff
#Graphs %Captured Proximity Size of graph
Proximity as link predictor • Calculate proximities for a sample of pairs in the network that have never communicated. • Look in the future to see which of these communicate in the next time period t. • Did those that eventually communicate have closer proximities. • i.e. is proximity predictive of future communication?
Proximity as link predictor Mean log proximity: Communicators = -2.4 Non-comm. = -5.9
Using Visualization • Different Visualizations bring out different aspects of the proximity graph, especially for n>2.
So who’s closer? + 1.20 - 0.53 - 3.02
Using a hierarchical layout for n=2 shows different eras of movie stars
Prox webpage http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl
Summary • Proposedcycle free effective conductance (CFEC) with a random walk interpretation to measure “proximity” in social networks and other ad-hoc networks Extensions • Compare to other proximity measures (Katz, PageRank, and other methods compared in Liben-Nowell and Kleinberg (2003)) • Quantify proximity across different kinds of networks • Extend CFEC to directed edges (Hanghang Tong) Try it out! http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl
Extensions http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl