1 / 41

What is Proximity?

Measuring and Extracting Proximity in Networks Yehuda Koren, Stephen North and Chris Volinsky KDD 2006 Philadelphia. What is Proximity?. proximity  n. adjacency, nearness, closeness, vicinity. What is the distance (closeness) between two nodes in a social network?. Why proximity?.

vidal
Download Presentation

What is Proximity?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring and ExtractingProximity in NetworksYehuda Koren, Stephen North and Chris VolinskyKDD 2006Philadelphia

  2. What is Proximity? • proximity n. adjacency, nearness, closeness, vicinity • What is the distance (closeness) between two nodes in a social network?

  3. Why proximity? Proximity measures… • Social “closeness” • similarity • Information flow …and can be used for… • Missing Data • Link Prediction …in the following applications… • Fraud detection • Viral marketing • Identifying clusters

  4. Our Goals • Measure and visualize proximity between nodes. • Measurement should have the following qualities: • “Close” nodes are intuitive • Short graph distance • Multiple paths • High weights on edges • Low degree nodes in the paths • Monotonicity • Generalizes to n > 2.

  5. Our goals • Explain proximity by extracting proximity subgraphs that are readily visualized and contain a large percentage of overall proximity. • Idea comes from “connection subgraphs” (Faloutsos, McCurley and Tomkins 2004). Prox = .0048 Prox = .0053

  6. Measuring proximity • Many proposals in the literature (n.b. Liben-Nowell and Kleinberg 2003) • Graph distance: shortest path • Doesn’t account for path length, multiple paths, or high-degree nodes (A,D) • Maximum Network Flow • Disregards path length, high degree nodes (A,B) (A,C) • depends on bottlenecks ( B,E) • Electrical networks, or “effective conductance” (e.g. Doyle and Snell 1984) • High degree nodes still a problem

  7. When is the electric current analogy misleading? • Same current-flow in both cases! • Degree-1 nodes are neutral Significant connection Noise?

  8. Sink- augmented effective conductance [Faloutsos, McCurley & Tomkins, KDD 2004] • Connect all nodes to a grounded universal sink (with 0V) • Tax each node - deliver portion of the flow to the sink • No nodes of degree 1 (above problem solved) • Penalizes long paths • How do we set taxing system? • Doesn’t generalize to n > 2 • No monotonicity…

  9. Universal sink and (non-)monotonicity With universal sink – no monotonicity: • For larger networks,proximity tends to zero creating a “size bias”. • Adding s—t paths can either increase or decrease proximity! Proximity Network size

  10. Electrical networks = random walks • Current-flow notions have direct random walk interpretation • Take a random walk starting at s, following edges of the graph proportional to their weight (conductance). • Let D(s), the degree of s, be the number of random walks originating at s. Then: • The escape probability, EP(st), is the probability that a walk originating at s will reacht before visiting sagain , and • The effective conductance between s and t: EC(s,t) = EP(st) * Deg(s)

  11. Electrical networks = random walks With the random walk perspective, you can see that the 1-degree nodes have no influence. By discouraging “backtracking”, we now can properly account for high degree nodes

  12. Our proximity: cycle free effective conductance • The cycle-free escape probability, CFEP(st) is the probability that a random walk originating at s will reachtwithout visiting any node more than once • Multiplying by degree of the source gives an absolute quantity (accounting for the number of "actually initiated" walks): • The cycle-free effective conductance between s and t:CFEC(s,t) = CFEP(st) * Deg(s)

  13. Higherredgreen c.f. escape probability Lowerredgreen c.f. escape probability • Properties of CFEC as a proximity measure: • Favors multiple paths • Favors short paths • Penalizes high-degree nodes • Penalizes dead-end paths • Parameter free • Has the “right” monotonicity • Accommodates edge directions • Has a natural extension to n > 2

  14. Computing CFEC • Unlike previous measures, exact computation is impossible. Let SP(s,t) be the set of all simple paths: • Note that the probability of paths declines exponentially (e.g., 100th path is x106 less probable than the first one.) • Estimate using the set of most probable paths, SP’: Finding k shortest simple paths takes O(k|E|log|E|) time [Katoh, Ibarki and Mine, 1982]

  15. Extracting proximity graphs Recall FMT’04 “connection subgraphs”, the small subgraph that best captures the connections between two nodes of the graph

  16. Extracting proximity graphs • Achieve an efficient balance between “size” and “proximity” by maximizing the ratio: • Larger a  emphasize proximity  larger subgraph • a =0  return shortest path • a=∞ return all paths

  17. Extracting proximity graphs • We already have the collection, Rk of shortest paths {P1,P2,…,Pk} • Find the subset of the paths that maximizes … and combine the selected paths into a “proximity graph” • This is an NP-hard problem, but recall that we have a list of paths sorted by probability • Use a branch and bound path merging algorithm

  18. Proximity Graph N ~ 20 Working with large graphs • Dealing with full graph is sometimes infeasible and usually unnecessary • Prior to running the algorithm, we construct a candidate graph in main memory (also FMT ’04). full network N ~ 350M Candidate graph N ~ 10,000

  19. Finding the candidate graph S T

  20. Dist(S,i)=2 Dist(T,i)=2 S T

  21. Dist(S,i)=3 Dist(T,i)=3 S T

  22. Dist(S,i)=4 Dist(T,i)=4 S T

  23. Dist(S,i)=5 Dist(T,i)=5 S T Shortest path of length 10

  24. S T i Dist(S,i)=12 Dist(T,i)=12 • Once we have this candidate graph, apply CFEC algorithm to extract proximity graph. • Stop adding nodes when path probabilities are below e • Any path through unscanned node is likely to be low probability

  25. Summary: Proximity Graphs • We have a measure of proximity which fulfills our desired criteria • Intuitive sense of closeness • Generalizes to n>2 • Parameter free • Using this measure of proximity we can efficiently extract the proximity graph. • Let’s apply to real data

  26. Application: call detail • AT&T’s call detail graph is large (350M nodes, several billion edges). • To calculate proximity, we just need an adjacency list • Dynamic, efficient creation of adjacency lists for transaction graphs (Cortes, Pregibon, and Volinsky 2003) • Select a random sample of 2000 residential TNs and calculate proximity between them. • We found a path for 1808 of them • For those that we found a path, we calculated proximity, and rendered a proximity graph for them.

  27. Distribution of proximities in phone-call network

  28. Application: call detail • Capturing proximity in a proximity graph…. • Studying a • Low alpha: smaller graphs, less proximity captured. a = 10 seems to give a good tradeoff

  29. #Graphs %Captured Proximity Size of graph

  30. Proximity as link predictor • Calculate proximities for a sample of pairs in the network that have never communicated. • Look in the future to see which of these communicate in the next time period t. • Did those that eventually communicate have closer proximities. • i.e. is proximity predictive of future communication?

  31. Proximity as link predictor Mean log proximity: Communicators = -2.4 Non-comm. = -5.9

  32. Using Visualization • Different Visualizations bring out different aspects of the proximity graph, especially for n>2.

  33. So who’s closer? + 1.20 - 0.53 - 3.02

  34. Using a hierarchical layout for n=2 shows different eras of movie stars

  35. Prox webpage http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl

  36. Summary • Proposedcycle free effective conductance (CFEC) with a random walk interpretation to measure “proximity” in social networks and other ad-hoc networks Extensions • Compare to other proximity measures (Katz, PageRank, and other methods compared in Liben-Nowell and Kleinberg (2003)) • Quantify proximity across different kinds of networks • Extend CFEC to directed edges (Hanghang Tong) Try it out! http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl

  37. Extensions http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl

More Related