SimRank: A Measure of Structural-Context Similarity

SimRank: A Measure of Structural-Context Similarity Glen Jeh & Jennifer Widom Proceedings of the eighth ACM SIGKDD international conference on Knowledge Discovery and Data Mining 2002 Edmonton, Alberta, Canada

Outline • Introduction • Related Work • Basic Graph Model • SimRank • Experimental Results • Conclusion

Introduction • Many applications require a measure of “similarity” between objects. • “Similar Pages” in search engine (google) • Recommender System (amazon.com) • Citation of scientific papers (citeseer.com)

Introduction • Similarity Characteristics • Various aspects of objects determine similarity • Similarity varies in different domains • They found • The similarity in a domain can be modeled as graphs [nodes->objects & edges-> relationship] • They proposed • A general approach to exploit the object-to-object relationships in many domains • A similarity measure between nodes based on the structural context where they apear • Their intuition: • Resemble to PageRank algorithm from Google, • Similar objects are related to similar objects (recursive definition), the base case is that objects are similar to themselves.

Introduction • Example: Web Pages Graph

Related Work • Co-citation and bibliographic coupling • Co-citation: sim. btw. paper p & q is based on the number of papers which cite both p & q • Bibliographic coupling: sim. Is based on the number of papers cited by both p & q • Iterative algorithms over the web graph • Google’s PageRank: a page’s authority is decided by its neighbors’ authorities • Similarity in textual content • Vector-cosine similarity, Pearson correlation in IR • Collaborative filtering • Recommender System

Related Work • SimRank vs. Co-citation • Prof1 & Prof2 share common parent • Co-citation metric • StudentA, StudentB & StudentC common grandparent • SimRank

Basic Graph Model • G = (V, E) [vertex, edge] • nodes in V -> objects in domain, eg. nodes p & q • directed edges in E -> relationships btw. objects, eg. <p, q> • For node v , define: • I(v): the set of in-neighbors of v • O(v): the set of out-neighbors of v • Ii(v): individual in-neighbor • Oi(v): individual out-neighbor

Sim. can be thought of as “propagating” from pair pair. Given graph G, G2=(V2, E2) where V2=V x V, represents a pair (a,b) of nodes in G An edge from <a,c> to <b,d> exists in E2, iff. the edges <a,b> and <c,d> exist in G Basic Graph Model Two types of nodes omitted 1. Singleton like {ProfA, ProfA} 2. 0 Similarity nodes like {ProfA,StuA}

Motivation • Recursive Intuition • Two objects are similar if they are referenced by similar objects • A object is maximally similar to itself (score=1)

SimRank Equation: Homogeneous domain • Homogeneous: consisting of 1 type of objects • Similarity btw. a & b denoted by: • if a = b, s(a,b) = 1,  s(a,a) = s(b,b) = 1 • otherwise: • C is called as “confidence level” or “decay factor”. a constant btw. 0 & 1 • if |I(a)| or |I(b)| is 0, s(a,b) = 0 • symmetric : s(a,b) = s(b,a) • Similarity btw. a & b is the average similarity btw. in-neighbors of a and in-neighbors of b

The decay factor C • Why use C? • page x references two pages: c & d • we know s(x,x) = 1 • so, can we say s(c, d) = s(x,x) = 1 confidently? • C < 1 represent that similarity decays across edges • C is always a empirical value based on different domain

Extension: SimRank in Bipartite domain • Bipartite: consisting of 2 types of objects • Recommender system: Buyer and Item

Extension: SimRank in Bipartite domain • Bipartite Equation • Directed edges go from person to items, so • S(A,B) denotes the sim. btw. persons A & B: • s(c,d) denotes the sim. btw. items c & d : • Sim. btw. people A & B is the average sim. btw. the items they purchased • Sim. btw. item c & d is the average sim. btw. the people who purchased them

Extension: the MiniMax variation in Bipartite domain • Finding sim. btw. students and btw. courses based on student’s course taken histroy • Intermediate terms • Finally, • Can be applied to course sim. s(c,d) too

Computing SimRank • Naïve Method • For a graph G, • Iteration: • Rk(*,*) is non-decreasing as k increase • Also, • In experiments, when K = 5, Rkis rapidly converged • Space complexity : O(n2) to store the result Rk, • Time complexity : O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b)

Computing SimRank • Pruning the logical graph G2 • In naïve method, • all n2 nodes of G2 are considered, and • sim. score are computed for every node-pair • But, nodes far from a node v has less sim. score with v than nodes near v • Pruning: • set the sim. btw. two nodes far apart to be 0, and • consider node-pairs only for nodes which are near each other in the range of radius r • space complexity: O(ndr) • time complexity: O(Kndrd2)

Limited-Information Problem • Unpopular documents with few in-citation but they are important, eg. new paper • co-citation scheme fails in this case since it only compute the sim. from immediate neighbors in the graph structure • But, SimRank uses the entire graph structure

Limited-Information Problem • Task: find a similar document of A • A is only cited by B which also cites A1,A2…, Am • In co-citation, any of A1..Am equally similar to A • But, we want to uses other outlier information: • Aiis more similar to A if it is also cited by other documents which is similar to B • Amis the better match since it is also cited by B’ which is similar to B

Limited-Information Problem • Goal: neither eliminate unpopular doc, nor put popular doc to be favored for every query • If eliminate in SimRank equation • Then b with a high popularity would have a high similarity score with any other document a • Solution: asymmetric formula • P is a constant parameter adjustable by the end user.

Random Surfer-Pairs Model • The SimRank score s(a,b) measures how soon two random surfers are expected meet at the same node if they starts at node a and b and randomly walked the graph backward • Refer the paper for details.

Experiment Results • Test data set • ResearchIndex (www.researchindex.com) • a corpus of scientific research papers • 688,898 cross-reference among 278,628 papers • Student’s transcripts • 1030 undergraduate students in the School of Engineering at Stanford University • Each transcript lists all course that the student has taken so far (average: 40 courses/student)

Experiment Results • Compare SimRank with Co-Citation • Evaluation algorithm • Generate a set topA,N(p) of the top N objects most similar to p (except p itself), according to algorithm A. • For each , compute ,where is a coarse domain-specific similarity measure. Return the average of these scores. • The gives the average “actual” similarity to p of the top N objects that algorithm A decides are similar to p.

Experiment: Scientific Papers • Evaluation function • P=0.5, C1=C2=0.8 • Top N from 5 to 50

Experiment: Scientific Papers

Experiment: Students and Courses • Bipartite domain • External similarity for courses only, and based on departments: • C1=C2=0.8, N=5, N=10

Experiment: Students and Courses Co-citation scores are very poor (=0.161 for N=5, and =0.147 for N=10), so are not shown in the graph.

Conclusion • Main contribution • A formal definition for SimRank similarity scoring over arbitrary graphs, several useful derivatives of SimRank, and an algorithm to compute SimRank • A graph-theoretic model for SimRank that gives intuitive mathematical insight into its use and computation • Experimental results using an in-memory implementation of SimRank over two real data sets shows the effectiveness and feasibility of SimRank

Open Issues or Future Work • Didn’t address the efficiency and scalability issues • Didn’t consider more relationships in computing similarity • Didn’t combine other domain-specific similarity measures.

Thank You!

SimRank: A Measure of Structural-Context Similarity