620 likes | 1.88k Views
SimRank : A Measure of Structural-Context Similarity. Glen Jeh and Jennifer Widom KDD 2002. CS 519 Class Presentation Presenter: Anh Pham. Outline of the talk. Introduction to Structural Context Similarity SimRank Computing SimRank Naïve method Pruning Example
E N D
SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom KDD 2002 CS 519 Class Presentation Presenter: Anh Pham
Outline of the talk • Introduction to Structural Context Similarity • SimRank • Computing SimRank • Naïve method • Pruning • Example • Limited information problem • Random surfer pair model • Experimental results • Strong and weak points • Quiz
Finding similarity objects problem • There are a lot of applications • Find similar documents: • Collaborative filtering: • Find similar users • Find similar items
Aspects of objects for similarity • Many aspects making similarity • Documents: common words, sentence… • Users: common preferences
Structure similarity • This paper proposes a general approach which can be applied when the data can be represented as graph • Web page cases: • Users preferences: • Scientific network:
Example of structure similarity • Intuition: similar objects are related to similar objects • Example: Prof. A has student A & Prof. B has student B Prof. A and Prof. B are similar, since they from the same univ. Recursively, student A and student B are similar. If we know the similarity of Prof. A and B, we may estimate the similarity btw student A and B
Some basic notations in graph models • Graph G=(V,E) where V represent the nodes, and E represent the edges. • If nodes p and q, then <p,q> denotes the edge from p to q. • I(v) denotes the in-neighbors of v • O(v) denotes the out-neighbors of v I(C)={A,B} and O(A)={C,D} C A D B E
Node pair graph • Creating a node pair graph G2 from G • <(p,q),(a,b)> is in G2 if <p,a> and <q,b> are in G • Example:
Simrank motivation • Intuition: similar objects are related to similar objects Univ=Univ Sim(Univ, Univ)=1 Prof. A related to Univ Prof. B related to Univ Sim(Univ, Univ)=.414 <1 Student A related to Prof. A Student B related to Prof. B Sim(SA, SB)=.331 <1
Simrank equation • Similarity btw a and b: • Example: • Assume C=1 F 1 S(F,D)= [S(A,A)+S(B,A)] A * |2|*|1| D B =1/2*(1+0.5)=0.75 E
Simrank equation (1) • Similarity btw a and b: • s(a,b) is symmetric • s(a,a)=1 • s(a,x)=0 if x has no neighbor
Simrank equation (2) • Similarity btw a and b: • s(a,b) is normalized into (0,1) • Proof: By induction • C<1 • s(Ii(a),Ij(b))<1 A B A B
Simrank equation (2) • Similarity btw a and b: • Factor C should be <1 • C represent the confidence level, propagated from the parent nodes
Bipartite Simrank • Consider a recommendation system: • How we can recommend a item to a new buyer? • A and B are similar since they both buy frosting and eggs recommend flour for A
Bipartite Simrank (mutually-reinforcing rule) • Rule 1: People are similar if they purchase similar items • Rule 2: Items are similar if they are purchased by similar people • Rule 1 reinforces Rule 2, and vice versa • Example: • If frosting and eggs are similar, then • A and B also similar. • 2. If A and B are similar then frosting and eggs are similar. • Observation: We can magically see the • similar of sugar and flour, even though • there is no common customer.
Bipartite Simrank (formula) Rule 1: People are similar if they purchase similar items Rule 2: Items are similar if they are purchased by similar people Rule 1 (in math form) Rule 2 (in math form)
Bipartite Simrank (Homogeneous domain extension) • Previously: • Why use Out-links also the extension: • Depend on the application, use either score or both (remember about HITS algorithm)
Minimax extension • Example: Given CS students A and B. • Both A, B take CS-required courses • For elective courses, A takes sociology • For elective courses, B takes English • Previously: • How to only compare A’s CS courses with B’s CS course and A’s elective courses with B elective courses??? • Meaningless to compare • A’s CS courses with B’s elective • course !!!
Minimax extension (Cont.) • Example: Given CS students A and B. • Both A, B take CS-required courses. For elective courses, A takes sociology and B takes English. Only compare a course of A with the most similar course of B
Naïve method to compute Simrank • Naïve method is an iterative method • Rk(a,b) stores similarity of a and b in iteration kth: • Initialize R0(a,b)=1 if a==b and R0(a,b)=0 o.w. • Update Rk+1 from Rk • Until converge
Time analysis of Naïve method • Assume there are n nodes in G the required space is O(n2) to store pairs. • Assume d is the average of |I(a)||I(b)| each iteration take O(d) for each pair. • Assume K is the number of iterations • 1,2,3 time complexity is O(dn2K) • Empirical note: K≈5 in practice
Pruning to save time complexity • Previously, we assume the size of the node-pairs graph is n2 we consider all pairs. • In practice, given a node a, node v is far from a will have s(v,a)=0 it is efficient to consider only r-radius neighbor of a v v a a sk+1(a,v) = 0, since they are far way sk+1(a,v) = … sk
Time analysis of pruning • Previously, full n2 pairs O(dn2K) • Now, r-radius pairs O(dnrK) sk+1(a,v) = … sk v v a a sk+1(a,v) = 0, since they are far way
See how Simrank solve “limited information problem” • Limited information problem : • Find similar paper to A? • There is little information (only B cite A) • Among A1, A2,…, Am, which one is more similar to A? • Co-citation algorithm cannot solve LIP: • All A1, A2,…, Am share 1 common in link with A they are equally similar to A • Simrankcansolve LIP!!! • A is cited by B’, and B’ is similar to B Am is more similar to A than other Ai Limited information problem
Random Surfer Pair model • Random surfer pair model provides an intuitive way of SimRank • Example: SimRank(m,d) can be explained in random walk: m d m d Case 1: high probability that m and d meet together in one step
Random Surfer Pair model (Cont.) a a m d m d Case 1: high probability that m and d meet together SimRank(m,d) is high a y a m d m d Case 2: high probability that m and d meet together SimRank(m,d) is lower
Random Surfer Pair model (Cont.) a y Step 1 How to compute m(m,d) m d SimRank(m,d)= expect meeting distance (m,d) = m(m,d) Step 2 where =
Experimental set up • Dataset: • Research Index dataset: papers and their citation • Almost 700,000 cross citations among 270,000 papers • Student and course dataset: students and their courses (bipartie graph) • 1030 students, each take around 40 courses
Experimental set up • Baseline method: • Co-citation: Measure the number of shared objects • How to evaluate the algorithm: • Select objects p • Select top N similar object • Average the similar scores of them, based on a domain specific measure
Trend on computing SimRank on MapReduce • Delta-SimRank Computing on MapReduce. Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. (BigMine’12). zeros We only need to send values greater than zeros save communication cost over MapReduce!!!
Good points • The paper proposes a novel method to compute the similarity of objects, in general, based on the structure of data • The paper proposes a method to compute and efficient pruning technique • The paper provides an intuition for the method • There are good experiments results prove their idea
Weak points • Scalability: The paper should mention about very huge size graph. • It may incorporate distributed design. Since the algorithm is fixed point process, it should be a research problem on how to parallelize it.
Quiz • Intuitively, in which graph, the SimRank of a and b are higher ? a b a b