SimRank: A Measure of Structural-Context Similarity

SimRank: A Measure of Structural-Context Similarity Glen Jeh & Jennifer Widom KDD 2002

Motivation • Many applications require a measure of “similarity” between objects. • Web search • Shopping Recommendations • Search for “Related Works” among scientific papers • But “similarity” may be domain-dependent. • Can we define ageneric model forsimilarity?

Common Ground • What do all these applications have in common? data set of objects linked by a set of relations. • Then, a generic concept of similarity is structural-context similarity. • “Two objects are similar if the relate to similar objects.” • Recall automorphic equivalence: • “Two objects are equivalent if the relate to equivalent objects.”

Problem Statement • Given a Graph G = (V, E), for each pair of vertices a,b ∈ V, compute a similarity (ranking) score s(a,b) based on the concept of structural-context similarity.

Basic Graph Model • Directed Graph G = (V,E) • V = set of objects • E = set of unweighted edges • Edge (u,v) exists if there is an relation u  v • I(v) = set of in-neighbors of vertex v • O(v) = set of out-neighbors of vertex v

SimRank Similarity • Recursive Model • “Two objects are similar if they are referenced by similar objects” • That is, a ~ b if • c  a and d  b, and • c ~ d • An object is equivalent to itself (score = 1) • Example • ProfA ~ ProfB because both arereferenced by Univ. • StudentA ~ StudentB because theyare referenced by similar nodes{ProfA,ProfB}

Basic SimRank Equation • s(a,b) = similarity between a and b = average similarity between in-neighbors of a and in-neighbors of b • s(a,b) is in the range [0, 1] • If a=b, then s(a,b) = 1 • If a≠b, • C is a constant, 0 < C < 1 • if I(a) or I(b) = ∅ , then s(a,b) = 0

Decay Factor C a • X is identical to itself:s(x,x) = 1 • Since we have xa and x b,should s(a,b) = 1 also? • If the graph represented all the information about x, a, and b, then s(a,b) would ideally = 1. • But, in reality the graph does not describe everything about them, so we expect s(a,b) < 1. • Therefore, the constant C expresses ourlimited confidence or decay with distance:s(a,b) = C ∙ average similarity of (I(a), I(b)) x b

G2 Paired-Vertex Perspective • Given graph G, define G2=(V2, E2) where • V2=V x V. Each vertex in V2 is a pair of vertices in V. • E2: (a,b)(c,d) in G2 iff ac and bd in G • Since similarity scores are symmetric, (a,b) and (b,a) are merged into a single vertex.

Source and Flow of Similarity • SimRank score for a vertex (a,b) in G2= similarity between a and b in G. • The source of similarity is self-vertices, like (Univ, Univ). • Then, similarity propagates along pair-paths in G2, away from the sources. • Note that values decrease away from (Univ, Univ)

SimRank in Bipartite Domains • Bipartite: 2 types of objects • Example: Buyers and Items

Bipartite SimRank Equations • Two types of similarity: • Two buyers are similar if they buy the similar items • Out-neighbors of buyers are relevant: • Two items are similar if they are bought by similar buyers • In-neighbors of items are relevant: • In general, we can use I(.) and/or O(.) for any graph

MiniMax Variant • Motivation: Two students A and B take the same courses: {Eng1, Math1, Chem1, Hist1} • SimRank compares each course of A with each course of B • But intuitively we just want the best matching pairs:s(Eng1A,Eng1B), s(Math1A,Math1B) , etc. • Solution: Two steps • Max: Pair each neighbor of A with only its most similar neighbor of B. Do the same in the other direction:Min: Final s(A,B) is the smaller of sA(A,B) and sB(A,B) [weakest link]

Computing SimRank • Rk(a,b) = estimate of SimRank after k iterations. • Initialization: • Iteration: • Rk(a,b) is the similarity that has flowed a distance k away from the sources.Rk values are non-decreasing as k increases. • We can prove that Rk(a,b) converges to s(a,b)

Time and Space Complexity • Space complexity : O(n2) to store Rk(a,b) • Time complexity : O(kn2d2), d2 is the average of |I(a)||I(b)| over all vertex pairs (a,b) • To improve performance, we can prune G2: • Idea: vertices that are far apart should have very low similarity. We can approximate it as 0. • Select a radius r. If vertex-pair (a,b) cannot meet in less than r steps, remove it from the graph G2. • space complexity: O(ndr) • time complexity: O(Kndrd2),dr = avg. number of neighbors within radius r.

Random Surfer-Pairs Model • SimRank s(a,b) measures how soon two random surfers are expected to meet at the same node if they start at nodes a and b and randomly walk the graph backwards • Background: Basic Forward Random Walk • Motion is in discrete steps, using edges of the graph. • Each time step, there is an equal probability of moving from your current vertex to one of your out-neighbors. • Given adjacency matrix A, the probability of walking from x to y is pxy = axy/O(x). • Random Walk as a Markov Process • Initial location is described by the prob. distribution vector π(0) • Prob. of being at y at time 1:

Random Walk Transition Matrices • Given adjacency matrix A: • The forward and backward transition matrices:

Paired Backwards Random Walk • Probability of walking backwards to x in one step: • Two walkers meet at x if they start at a and b, and if one goes x a and the other goes x b, respectively.sx(a,b) = P(meeting at x) = π(a,b) p(xa) p(xb)s(a,b) = P(meeting) = Σxπ(a,b) p(xa) p(xb) • If they start together, they have met,so s(0)xy = 1 if i = j; 0 otherwise [identity matrix] • Then

Experiments: Data Sets • Two data sets • ResearchIndex (www.researchindex.com) • a corpus of scientific research papers • 688,898 cross-reference among 278,628 papers • Student’s transcripts • 1030 undergraduate students in the School of Engineering at Stanford University • Each transcript lists all course that the student has taken so far (average: 40 courses/student)

Performance Validation Metric • Problem: Difficult to know what is the “correct” similarity between items. • Solution: Define a rough domain-specific metric σ(p,q): • For scientific papers, we have two versions: σC(p,q) = fraction of q’s citations also cited by p σT(p,q) = fraction of words in q’s title also in p’s title • For university courses: σD(p,q) = 1 if p, q are in the same department, else 0

Computing the Performance Score • Run the similarity algorithms: • SimRank (naïve, pruned, minmax) • Co-Citation • For each object p and algorithm A, form a set topA,N(p) of the N objects most similar to p. • For each q ∈ topA,N(p), compute σ(p,q). • Return the average σA,N(p) over all q.

Experiment: Scientific Papers • Setup • Used bipartite SimRank, only considering in-neighbors (validation uses out-neighbors) • N ∈ {5, 10, …, 45, 50} • Results • Not very sensitive to decay factors C1 and C2 • Pruning the search radius had little effort on rank order of scores.

Results: Scientific Papers

Experiment: Students and Courses • Setup • Bipartite domain • N ∈ {5, 10} • Results • Min-Max version of SimRank performed the best • Not very sensitive to decay factors C1 and C2

Results: Students and Courses Co-citation scores are very poor (=0.161 for N=5, and =0.147 for N=10), so are not shown in the graph.

Conclusions • Defined a recursive model of structural similarity between objects in a network • Mathematically formulated SimRank based on the recursive concept • Presented a convergent algorithm to compute SimRank • Described a random-walk interpretation of SimRank equations and scores • Experimentally validated SimRank over two real data sets

Open Issues and Critique • O(n2) is large; scalability needs to be improved. • s(a,b) only includes contributions for paths when a and b are the same distance from some x.What if the distances are offset (total is odd)? • As |I(a)| and |I(b)| increase, SimRank decreases, even if I(a) = I(b)! • Addressed partially by Minimax method

SimRank: A Measure of Structural-Context Similarity

SimRank: A Measure of Structural-Context Similarity

Presentation Transcript

chapter 4: structural realism by john j. mearsheimer

Pediatric OMT

Curriculum Learning for Latent Structural SVM

68402: Structural Design of Buildings II 61420: Design of Steel Structures 62323: Architectural Structures II

BLAST Similarity Searching

Ch4.1A – Radian and Degree Measure r

Chromosomal Abnormalities II SDK October 28, 2013

Structural Equation Modeling Using Mplus

Sequence similarity Analysis

Context-aware Computing: Basic Concepts

BBI 3219

Introduction to UML: Structural and Use Case Modeling

Prof. Eng. Claudio Modena Full Professor of Structural Engineering

Overview of Peter D. Turney’s Work on Similarity

Structure Engineering 101 For Mechanical Engineers

Introduction to UML: Structural and Use Case Modeling

Charting the Protein Space Structural and Functional Genomics

Activator

Phenetics vs. Cladistics

Context-aware Services in Ubiquitous Network