310 likes | 745 Views
SimRank: A Measure of Structural-Context Similarity. Glen Jeh & Jennifer Widom KDD 2002. Motivation. Many applications require a measure of “similarity” between objects. Web search Shopping Recommendations Search for “Related Works” among scientific papers
E N D
SimRank: A Measure of Structural-Context Similarity Glen Jeh & Jennifer Widom KDD 2002
Motivation • Many applications require a measure of “similarity” between objects. • Web search • Shopping Recommendations • Search for “Related Works” among scientific papers • But “similarity” may be domain-dependent. • Can we define ageneric model forsimilarity?
Common Ground • What do all these applications have in common? data set of objects linked by a set of relations. • Then, a generic concept of similarity is structural-context similarity. • “Two objects are similar if the relate to similar objects.” • Recall automorphic equivalence: • “Two objects are equivalent if the relate to equivalent objects.”
Problem Statement • Given a Graph G = (V, E), for each pair of vertices a,b ∈ V, compute a similarity (ranking) score s(a,b) based on the concept of structural-context similarity.
Basic Graph Model • Directed Graph G = (V,E) • V = set of objects • E = set of unweighted edges • Edge (u,v) exists if there is an relation u v • I(v) = set of in-neighbors of vertex v • O(v) = set of out-neighbors of vertex v
SimRank Similarity • Recursive Model • “Two objects are similar if they are referenced by similar objects” • That is, a ~ b if • c a and d b, and • c ~ d • An object is equivalent to itself (score = 1) • Example • ProfA ~ ProfB because both arereferenced by Univ. • StudentA ~ StudentB because theyare referenced by similar nodes{ProfA,ProfB}
Basic SimRank Equation • s(a,b) = similarity between a and b = average similarity between in-neighbors of a and in-neighbors of b • s(a,b) is in the range [0, 1] • If a=b, then s(a,b) = 1 • If a≠b, • C is a constant, 0 < C < 1 • if I(a) or I(b) = ∅ , then s(a,b) = 0
Decay Factor C a • X is identical to itself:s(x,x) = 1 • Since we have xa and x b,should s(a,b) = 1 also? • If the graph represented all the information about x, a, and b, then s(a,b) would ideally = 1. • But, in reality the graph does not describe everything about them, so we expect s(a,b) < 1. • Therefore, the constant C expresses ourlimited confidence or decay with distance:s(a,b) = C ∙ average similarity of (I(a), I(b)) x b
G2 Paired-Vertex Perspective • Given graph G, define G2=(V2, E2) where • V2=V x V. Each vertex in V2 is a pair of vertices in V. • E2: (a,b)(c,d) in G2 iff ac and bd in G • Since similarity scores are symmetric, (a,b) and (b,a) are merged into a single vertex.
Source and Flow of Similarity • SimRank score for a vertex (a,b) in G2= similarity between a and b in G. • The source of similarity is self-vertices, like (Univ, Univ). • Then, similarity propagates along pair-paths in G2, away from the sources. • Note that values decrease away from (Univ, Univ)
SimRank in Bipartite Domains • Bipartite: 2 types of objects • Example: Buyers and Items
Bipartite SimRank Equations • Two types of similarity: • Two buyers are similar if they buy the similar items • Out-neighbors of buyers are relevant: • Two items are similar if they are bought by similar buyers • In-neighbors of items are relevant: • In general, we can use I(.) and/or O(.) for any graph
MiniMax Variant • Motivation: Two students A and B take the same courses: {Eng1, Math1, Chem1, Hist1} • SimRank compares each course of A with each course of B • But intuitively we just want the best matching pairs:s(Eng1A,Eng1B), s(Math1A,Math1B) , etc. • Solution: Two steps • Max: Pair each neighbor of A with only its most similar neighbor of B. Do the same in the other direction:Min: Final s(A,B) is the smaller of sA(A,B) and sB(A,B) [weakest link]
Computing SimRank • Rk(a,b) = estimate of SimRank after k iterations. • Initialization: • Iteration: • Rk(a,b) is the similarity that has flowed a distance k away from the sources.Rk values are non-decreasing as k increases. • We can prove that Rk(a,b) converges to s(a,b)
Time and Space Complexity • Space complexity : O(n2) to store Rk(a,b) • Time complexity : O(kn2d2), d2 is the average of |I(a)||I(b)| over all vertex pairs (a,b) • To improve performance, we can prune G2: • Idea: vertices that are far apart should have very low similarity. We can approximate it as 0. • Select a radius r. If vertex-pair (a,b) cannot meet in less than r steps, remove it from the graph G2. • space complexity: O(ndr) • time complexity: O(Kndrd2),dr = avg. number of neighbors within radius r.
Random Surfer-Pairs Model • SimRank s(a,b) measures how soon two random surfers are expected to meet at the same node if they start at nodes a and b and randomly walk the graph backwards • Background: Basic Forward Random Walk • Motion is in discrete steps, using edges of the graph. • Each time step, there is an equal probability of moving from your current vertex to one of your out-neighbors. • Given adjacency matrix A, the probability of walking from x to y is pxy = axy/O(x). • Random Walk as a Markov Process • Initial location is described by the prob. distribution vector π(0) • Prob. of being at y at time 1:
Random Walk Transition Matrices • Given adjacency matrix A: • The forward and backward transition matrices:
Paired Backwards Random Walk • Probability of walking backwards to x in one step: • Two walkers meet at x if they start at a and b, and if one goes x a and the other goes x b, respectively.sx(a,b) = P(meeting at x) = π(a,b) p(xa) p(xb)s(a,b) = P(meeting) = Σxπ(a,b) p(xa) p(xb) • If they start together, they have met,so s(0)xy = 1 if i = j; 0 otherwise [identity matrix] • Then
Experiments: Data Sets • Two data sets • ResearchIndex (www.researchindex.com) • a corpus of scientific research papers • 688,898 cross-reference among 278,628 papers • Student’s transcripts • 1030 undergraduate students in the School of Engineering at Stanford University • Each transcript lists all course that the student has taken so far (average: 40 courses/student)
Performance Validation Metric • Problem: Difficult to know what is the “correct” similarity between items. • Solution: Define a rough domain-specific metric σ(p,q): • For scientific papers, we have two versions: σC(p,q) = fraction of q’s citations also cited by p σT(p,q) = fraction of words in q’s title also in p’s title • For university courses: σD(p,q) = 1 if p, q are in the same department, else 0
Computing the Performance Score • Run the similarity algorithms: • SimRank (naïve, pruned, minmax) • Co-Citation • For each object p and algorithm A, form a set topA,N(p) of the N objects most similar to p. • For each q ∈ topA,N(p), compute σ(p,q). • Return the average σA,N(p) over all q.
Experiment: Scientific Papers • Setup • Used bipartite SimRank, only considering in-neighbors (validation uses out-neighbors) • N ∈ {5, 10, …, 45, 50} • Results • Not very sensitive to decay factors C1 and C2 • Pruning the search radius had little effort on rank order of scores.
Experiment: Students and Courses • Setup • Bipartite domain • N ∈ {5, 10} • Results • Min-Max version of SimRank performed the best • Not very sensitive to decay factors C1 and C2
Results: Students and Courses Co-citation scores are very poor (=0.161 for N=5, and =0.147 for N=10), so are not shown in the graph.
Conclusions • Defined a recursive model of structural similarity between objects in a network • Mathematically formulated SimRank based on the recursive concept • Presented a convergent algorithm to compute SimRank • Described a random-walk interpretation of SimRank equations and scores • Experimentally validated SimRank over two real data sets
Open Issues and Critique • O(n2) is large; scalability needs to be improved. • s(a,b) only includes contributions for paths when a and b are the same distance from some x.What if the distances are offset (total is odd)? • As |I(a)| and |I(b)| increase, SimRank decreases, even if I(a) = I(b)! • Addressed partially by Minimax method