310 likes | 424 Views
The Link Prediction Problem for Social Networks David Libel- Nowell , MIT John Klienberg , Cornell. Saswat Mishra sxm111131. Summary. The “Link Prediction Problem”
E N D
The Link Prediction Problem for Social NetworksDavid Libel-Nowell, MITJohn Klienberg, Cornell Saswat Mishra sxm111131
Summary • The “Link Prediction Problem” • Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future? • Based on “proximity” of nodes in a network
Introduction • Natural examples of social networks: Nodes = people/entities Edges = interaction/ collaboration
Motivation • Understanding how social networks evolve • The link prediction problem • Given a snapshot of a social network at time t, we seek to accurately predict the edges that will be added to the network during the interval (t, t’) ?
Why? • To suggest interactions or collaborations that haven’t yet been utilized within an organization • To monitor terrorist networks - to deduce possible interaction between terrorists (without direct evidence) • Used in Facebook and Linked In to suggest friends • Open Question: How does Facebook do it? (friends of friends, same school, manually…)
Motivation • Co-authorship network for scientists • Scientists who are “close” in the network will have common colleagues & circles – likely to collaborate Caveat: Scientists who have never collaborated might in future - hard to predict • Goal: make that intuitive notion precise; understand which measures of “proximity” lead to accurate predictions D B A C
Goals • Present measures of proximity • Understand relative effectiveness of network proximity measures (adapted from graph theory, CS, social sciences) • Prove that prediction by proximity outperforms random predictions by a factor of 40 to 50 • Prove that subtle measures outperform more direct measures
Data and Experimental Setup • Co-authorship network (G) from “author list” of the physics e-Print arXiv (www.arxiv.org) • Took 5 such networks from 5 sections of the print D B B A A C C Training interval [1994,1996] Ktraining = 3 Test interval [1997,1999] Ktest = 3 Core: set of authors who have at least 3 papers during both training and test G[1994,1996] = Gcollab = (A,Eold) Enew = new collaborations (edges)
Methods for Link Prediction • Take the input graph during training period Gcollab • Pick a pair of nodes (x, y) • Assign a connection weight score(x, y) • Make a list in descending order of score • score is a measure of proximity • Any ideas for measures?
Graph distance & Common Neighbors • Graph distance: (Negated) length of shortest path between x and y • Common Neighbors: A and C have 2 common neighbors, more likely to collaborate E D B A C E D B A C
Jaccard’s coefficient and Adamic / Adar • Jaccard’s coefficient: same as common neighbors, adjusted for degree • Adamic / Adar: weighting rarer neighbors more heavily E D B A C
Preferential Attachment • Probability that a new collaboration involves x is proportional to T(x), current neighbors of x • score (x, y) :=
Considering all paths: Katz • Katz: measure that sums over the collection of paths, exponentially damped by length (to count short paths heavily) • β is chosen to be a very small value (for dampening) E D B A C
Hitting time, PageRank • Hitting time: expected number of steps for a random walk starting at x to reach y • Commute time: • If y has a large stationary probability, Hx,y is small. To counterbalance, we can normalize • PageRank: to cut down on long random walks, walk can return to x with a probablity α at every step y
SimRank • Defined by this recursive definition: two nodes are similar to the extent that they are joined by similar neighbors
Low-rank approximation • Treat the graph as an adjacency matrix • Compute the rank-k matrix Mk (noise-reduction) • x is a row, y is a row, score(x, y) = inner product of rows r(x) and r(y)
Unseen bigrams and Clustering • Unseen bigrams: Derived from language modeling • Estimating frequency of unseen bigrams – pairs of words (nodes here) that co-occur in a test corpus but not in the training corpus • Clustering: deleting tenuous edges in Gcollab through a clustering procedure and running predictors on the “cleaned-up” subgraph
Results • The results are presented as: • 1. Factor improvement of proposed predictors over • Random predictor • Graph distance predictor • Common neighbors predictor • 2. Relative performance vs. the above predictors • 3. Common Predictions
vs. graph distance predictor, vs. common neighbors predictor • a
Conclusions • No single clear winner • Many outperform the random predictor => there is useful information in the network topology • Katz + clustering + low-rank approximation perform significantly well • Some simple measures i.e. common neighbors and Adamic/ Adar perform well
Critique • Even the best predictor (Katz on gr-qc) is correct on only 16% of predictions • How good is that? • Treat all collaborations equally. Perhaps, treating recent collaborations as more important than older ones will help?
References • Lada A. Adamic and Eytan Adar. Friends and neighbors on the web. Social Networks, 25(3):211{230, July 2003. • A. L. Barabasi, H. Jeong, Z. N eda, E. Rav asz, A. Schubert, and T. Vicsek. Evolution of the social network of scientist collaboration. Physica A, 311(3{4):590{614, 2002. • Sergey Brin and Lawrence Page. The anatomy of a large-scale hyper textual Web search engine Computer Networks and ISDN Systems, 30(1{7):107{117, 1998. • Rodrigo De Castro and Jerrold W. Grossman. F amous trails to Paul Erdos. Mathematical Intelligencer, 21(3):51{63, 1999.
Question Question???