1 / 36

SimRank: A Measure of Structural-Context Similarity

SimRank: A Measure of Structural-Context Similarity. Glen Jeh & Jennifer Widom Proceedings of the eighth ACM SIGKDD international conference on Knowledge Discovery and Data Mining 2002 Edmonton, Alberta, Canada. Outline. Introduction Related Work Basic Graph Model SimRank

zita
Download Presentation

SimRank: A Measure of Structural-Context Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SimRank: A Measure of Structural-Context Similarity Glen Jeh & Jennifer Widom Proceedings of the eighth ACM SIGKDD international conference on Knowledge Discovery and Data Mining 2002 Edmonton, Alberta, Canada

  2. Outline • Introduction • Related Work • Basic Graph Model • SimRank • Experimental Results • Conclusion

  3. Outline • Introduction • Related Work • Basic Graph Model • SimRank • Experimental Results • Conclusion

  4. Introduction • Many applications require a measure of “similarity” between objects. • “Similar Pages” in search engine (google) • Recommender System (amazon.com) • Citation of scientific papers (citeseer.com)

  5. Introduction • Similarity Characteristics • Various aspects of objects determine similarity • Similarity varies in different domains • They found • The similarity in a domain can be modeled as graphs [nodes->objects & edges-> relationship] • They proposed • A general approach to exploit the object-to-object relationships in many domains • A similarity measure between nodes based on the structural context where they apear • Their intuition: • Resemble to PageRank algorithm from Google, • Similar objects are related to similar objects (recursive definition), the base case is that objects are similar to themselves.

  6. Introduction • Example: Web Pages Graph

  7. Outline • Introduction • Related Work • Basic Graph Model • SimRank • Experimental Results • Conclusion

  8. Related Work • Co-citation and bibliographic coupling • Co-citation: sim. btw. paper p & q is based on the number of papers which cite both p & q • Bibliographic coupling: sim. Is based on the number of papers cited by both p & q • Iterative algorithms over the web graph • Google’s PageRank: a page’s authority is decided by its neighbors’ authorities • Similarity in textual content • Vector-cosine similarity, Pearson correlation in IR • Collaborative filtering • Recommender System

  9. Related Work • SimRank vs. Co-citation • Prof1 & Prof2 share common parent • Co-citation metric • StudentA, StudentB & StudentC common grandparent • SimRank

  10. Outline • Introduction • Related Work • Basic Graph Model • SimRank • Experimental Results • Conclusion

  11. Basic Graph Model • G = (V, E) [vertex, edge] • nodes in V -> objects in domain, eg. nodes p & q • directed edges in E -> relationships btw. objects, eg. <p, q> • For node v , define: • I(v): the set of in-neighbors of v • O(v): the set of out-neighbors of v • Ii(v): individual in-neighbor • Oi(v): individual out-neighbor

  12. Sim. can be thought of as “propagating” from pair pair. Given graph G, G2=(V2, E2) where V2=V x V, represents a pair (a,b) of nodes in G An edge from <a,c> to <b,d> exists in E2, iff. the edges <a,b> and <c,d> exist in G Basic Graph Model Two types of nodes omitted 1. Singleton like {ProfA, ProfA} 2. 0 Similarity nodes like {ProfA,StuA}

  13. Outline • Introduction • Related Work • Basic Graph Model • SimRank • Experimental Results • Conclusion

  14. Motivation • Recursive Intuition • Two objects are similar if they are referenced by similar objects • A object is maximally similar to itself (score=1)

  15. SimRank Equation: Homogeneous domain • Homogeneous: consisting of 1 type of objects • Similarity btw. a & b denoted by: • if a = b, s(a,b) = 1,  s(a,a) = s(b,b) = 1 • otherwise: • C is called as “confidence level” or “decay factor”. a constant btw. 0 & 1 • if |I(a)| or |I(b)| is 0, s(a,b) = 0 • symmetric : s(a,b) = s(b,a) • Similarity btw. a & b is the average similarity btw. in-neighbors of a and in-neighbors of b

  16. The decay factor C • Why use C? • page x references two pages: c & d • we know s(x,x) = 1 • so, can we say s(c, d) = s(x,x) = 1 confidently? • C < 1 represent that similarity decays across edges • C is always a empirical value based on different domain

  17. Extension: SimRank in Bipartite domain • Bipartite: consisting of 2 types of objects • Recommender system: Buyer and Item

  18. Extension: SimRank in Bipartite domain • Bipartite Equation • Directed edges go from person to items, so • S(A,B) denotes the sim. btw. persons A & B: • s(c,d) denotes the sim. btw. items c & d : • Sim. btw. people A & B is the average sim. btw. the items they purchased • Sim. btw. item c & d is the average sim. btw. the people who purchased them

  19. Extension: the MiniMax variation in Bipartite domain • Finding sim. btw. students and btw. courses based on student’s course taken histroy • Intermediate terms • Finally, • Can be applied to course sim. s(c,d) too

  20. Computing SimRank • Naïve Method • For a graph G, • Iteration: • Rk(*,*) is non-decreasing as k increase • Also, • In experiments, when K = 5, Rkis rapidly converged • Space complexity : O(n2) to store the result Rk, • Time complexity : O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b)

  21. Computing SimRank • Pruning the logical graph G2 • In naïve method, • all n2 nodes of G2 are considered, and • sim. score are computed for every node-pair • But, nodes far from a node v has less sim. score with v than nodes near v • Pruning: • set the sim. btw. two nodes far apart to be 0, and • consider node-pairs only for nodes which are near each other in the range of radius r • space complexity: O(ndr) • time complexity: O(Kndrd2)

  22. Limited-Information Problem • Unpopular documents with few in-citation but they are important, eg. new paper • co-citation scheme fails in this case since it only compute the sim. from immediate neighbors in the graph structure • But, SimRank uses the entire graph structure

  23. Limited-Information Problem • Task: find a similar document of A • A is only cited by B which also cites A1,A2…, Am • In co-citation, any of A1..Am equally similar to A • But, we want to uses other outlier information: • Aiis more similar to A if it is also cited by other documents which is similar to B • Amis the better match since it is also cited by B’ which is similar to B

  24. Limited-Information Problem • Goal: neither eliminate unpopular doc, nor put popular doc to be favored for every query • If eliminate in SimRank equation • Then b with a high popularity would have a high similarity score with any other document a • Solution: asymmetric formula • P is a constant parameter adjustable by the end user.

  25. Random Surfer-Pairs Model • The SimRank score s(a,b) measures how soon two random surfers are expected meet at the same node if they starts at node a and b and randomly walked the graph backward • Refer the paper for details.

  26. Outline • Introduction • Related Work • Basic Graph Model • SimRank • Experimental Results • Conclusion

  27. Experiment Results • Test data set • ResearchIndex (www.researchindex.com) • a corpus of scientific research papers • 688,898 cross-reference among 278,628 papers • Student’s transcripts • 1030 undergraduate students in the School of Engineering at Stanford University • Each transcript lists all course that the student has taken so far (average: 40 courses/student)

  28. Experiment Results • Compare SimRank with Co-Citation • Evaluation algorithm • Generate a set topA,N(p) of the top N objects most similar to p (except p itself), according to algorithm A. • For each , compute ,where is a coarse domain-specific similarity measure. Return the average of these scores. • The gives the average “actual” similarity to p of the top N objects that algorithm A decides are similar to p.

  29. Experiment: Scientific Papers • Evaluation function • P=0.5, C1=C2=0.8 • Top N from 5 to 50

  30. Experiment: Scientific Papers

  31. Experiment: Students and Courses • Bipartite domain • External similarity for courses only, and based on departments: • C1=C2=0.8, N=5, N=10

  32. Experiment: Students and Courses Co-citation scores are very poor (=0.161 for N=5, and =0.147 for N=10), so are not shown in the graph.

  33. Outline • Introduction • Related Work • Basic Graph Model • SimRank • Experimental Results • Conclusion

  34. Conclusion • Main contribution • A formal definition for SimRank similarity scoring over arbitrary graphs, several useful derivatives of SimRank, and an algorithm to compute SimRank • A graph-theoretic model for SimRank that gives intuitive mathematical insight into its use and computation • Experimental results using an in-memory implementation of SimRank over two real data sets shows the effectiveness and feasibility of SimRank

  35. Open Issues or Future Work • Didn’t address the efficiency and scalability issues • Didn’t consider more relationships in computing similarity • Didn’t combine other domain-specific similarity measures.

  36. Thank You!

More Related