1 / 34

SimRank : A Measure of Structural-Context Similarity

SimRank : A Measure of Structural-Context Similarity. Glen Jeh and Jennifer Widom KDD 2002. CS 519 Class Presentation Presenter: Anh Pham. Outline of the talk. Introduction to Structural Context Similarity SimRank Computing SimRank Naïve method Pruning Example

joanne
Download Presentation

SimRank : A Measure of Structural-Context Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom KDD 2002 CS 519 Class Presentation Presenter: Anh Pham

  2. Outline of the talk • Introduction to Structural Context Similarity • SimRank • Computing SimRank • Naïve method • Pruning • Example • Limited information problem • Random surfer pair model • Experimental results • Strong and weak points • Quiz

  3. Finding similarity objects problem • There are a lot of applications • Find similar documents: • Collaborative filtering: • Find similar users • Find similar items

  4. Aspects of objects for similarity • Many aspects making similarity • Documents: common words, sentence… • Users: common preferences

  5. Structure similarity • This paper proposes a general approach which can be applied when the data can be represented as graph • Web page cases: • Users preferences: • Scientific network:

  6. Example of structure similarity • Intuition: similar objects are related to similar objects • Example: Prof. A has student A & Prof. B has student B Prof. A and Prof. B are similar, since they from the same univ. Recursively, student A and student B are similar. If we know the similarity of Prof. A and B, we may estimate the similarity btw student A and B

  7. Some basic notations in graph models • Graph G=(V,E) where V represent the nodes, and E represent the edges. • If nodes p and q, then <p,q> denotes the edge from p to q. • I(v) denotes the in-neighbors of v • O(v) denotes the out-neighbors of v I(C)={A,B} and O(A)={C,D} C A D B E

  8. Node pair graph • Creating a node pair graph G2 from G • <(p,q),(a,b)> is in G2 if <p,a> and <q,b> are in G • Example:

  9. Simrank motivation • Intuition: similar objects are related to similar objects Univ=Univ Sim(Univ, Univ)=1 Prof. A related to Univ Prof. B related to Univ  Sim(Univ, Univ)=.414 <1 Student A related to Prof. A Student B related to Prof. B  Sim(SA, SB)=.331 <1

  10. Simrank equation • Similarity btw a and b: • Example: • Assume C=1 F 1 S(F,D)= [S(A,A)+S(B,A)] A * |2|*|1| D B =1/2*(1+0.5)=0.75 E

  11. Simrank equation (1) • Similarity btw a and b: • s(a,b) is symmetric • s(a,a)=1 • s(a,x)=0 if x has no neighbor

  12. Simrank equation (2) • Similarity btw a and b: • s(a,b) is normalized into (0,1) • Proof: By induction • C<1 • s(Ii(a),Ij(b))<1 A B A B

  13. Simrank equation (2) • Similarity btw a and b: • Factor C should be <1 • C represent the confidence level, propagated from the parent nodes

  14. Bipartite Simrank • Consider a recommendation system: • How we can recommend a item to a new buyer? • A and B are similar since they both buy frosting and eggs  recommend flour for A

  15. Bipartite Simrank (mutually-reinforcing rule) • Rule 1: People are similar if they purchase similar items • Rule 2: Items are similar if they are purchased by similar people • Rule 1 reinforces Rule 2, and vice versa • Example: • If frosting and eggs are similar, then • A and B also similar. • 2. If A and B are similar then frosting and eggs are similar. • Observation: We can magically see the • similar of sugar and flour, even though • there is no common customer.

  16. Bipartite Simrank (formula) Rule 1: People are similar if they purchase similar items Rule 2: Items are similar if they are purchased by similar people Rule 1 (in math form) Rule 2 (in math form)

  17. Bipartite Simrank (Homogeneous domain extension) • Previously: • Why use Out-links also  the extension: • Depend on the application, use either score or both (remember about HITS algorithm)

  18. Minimax extension • Example: Given CS students A and B. • Both A, B take CS-required courses • For elective courses, A takes sociology • For elective courses, B takes English • Previously: • How to only compare A’s CS courses with B’s CS course and A’s elective courses with B elective courses??? • Meaningless to compare • A’s CS courses with B’s elective • course !!!

  19. Minimax extension (Cont.) • Example: Given CS students A and B. • Both A, B take CS-required courses. For elective courses, A takes sociology and B takes English. Only compare a course of A with the most similar course of B

  20. Naïve method to compute Simrank • Naïve method is an iterative method • Rk(a,b) stores similarity of a and b in iteration kth: • Initialize R0(a,b)=1 if a==b and R0(a,b)=0 o.w. • Update Rk+1 from Rk • Until converge

  21. Time analysis of Naïve method • Assume there are n nodes in G  the required space is O(n2) to store pairs. • Assume d is the average of |I(a)||I(b)| each iteration take O(d) for each pair. • Assume K is the number of iterations • 1,2,3 time complexity is O(dn2K) • Empirical note: K≈5 in practice

  22. Pruning to save time complexity • Previously, we assume the size of the node-pairs graph is n2  we consider all pairs. • In practice, given a node a, node v is far from a will have s(v,a)=0  it is efficient to consider only r-radius neighbor of a v v a a sk+1(a,v) = 0, since they are far way sk+1(a,v) = … sk

  23. Time analysis of pruning • Previously, full n2 pairs O(dn2K) • Now, r-radius pairs O(dnrK) sk+1(a,v) = … sk v v a a sk+1(a,v) = 0, since they are far way

  24. See how Simrank solve “limited information problem” • Limited information problem : • Find similar paper to A? • There is little information (only B cite A) • Among A1, A2,…, Am, which one is more similar to A? • Co-citation algorithm cannot solve LIP: • All A1, A2,…, Am share 1 common in link with A  they are equally similar to A • Simrankcansolve LIP!!! • A is cited by B’, and B’ is similar to B Am is more similar to A than other Ai Limited information problem

  25. Random Surfer Pair model • Random surfer pair model provides an intuitive way of SimRank • Example: SimRank(m,d) can be explained in random walk: m d m d Case 1: high probability that m and d meet together in one step

  26. Random Surfer Pair model (Cont.) a a m d m d Case 1: high probability that m and d meet together  SimRank(m,d) is high a y a m d m d Case 2: high probability that m and d meet together  SimRank(m,d) is lower

  27. Random Surfer Pair model (Cont.) a y Step 1 How to compute m(m,d) m d SimRank(m,d)= expect meeting distance (m,d) = m(m,d) Step 2 where =

  28. Experimental set up • Dataset: • Research Index dataset: papers and their citation • Almost 700,000 cross citations among 270,000 papers • Student and course dataset: students and their courses (bipartie graph) • 1030 students, each take around 40 courses

  29. Experimental set up • Baseline method: • Co-citation: Measure the number of shared objects • How to evaluate the algorithm: • Select objects p • Select top N similar object • Average the similar scores of them, based on a domain specific measure

  30. Experimental results

  31. Trend on computing SimRank on MapReduce • Delta-SimRank Computing on MapReduce. Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. (BigMine’12). zeros We only need to send values greater than zeros  save communication cost over MapReduce!!!

  32. Good points • The paper proposes a novel method to compute the similarity of objects, in general, based on the structure of data • The paper proposes a method to compute and efficient pruning technique • The paper provides an intuition for the method • There are good experiments results prove their idea

  33. Weak points • Scalability: The paper should mention about very huge size graph. • It may incorporate distributed design. Since the algorithm is fixed point process, it should be a research problem on how to parallelize it.

  34. Quiz • Intuitively, in which graph, the SimRank of a and b are higher ? a b a b

More Related