SimRank : A Measure of Structural-Context Similarity

SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom KDD 2002 CS 519 Class Presentation Presenter: Anh Pham

Outline of the talk • Introduction to Structural Context Similarity • SimRank • Computing SimRank • Naïve method • Pruning • Example • Limited information problem • Random surfer pair model • Experimental results • Strong and weak points • Quiz

Finding similarity objects problem • There are a lot of applications • Find similar documents: • Collaborative filtering: • Find similar users • Find similar items

Aspects of objects for similarity • Many aspects making similarity • Documents: common words, sentence… • Users: common preferences

Structure similarity • This paper proposes a general approach which can be applied when the data can be represented as graph • Web page cases: • Users preferences: • Scientific network:

Example of structure similarity • Intuition: similar objects are related to similar objects • Example: Prof. A has student A & Prof. B has student B Prof. A and Prof. B are similar, since they from the same univ. Recursively, student A and student B are similar. If we know the similarity of Prof. A and B, we may estimate the similarity btw student A and B

Some basic notations in graph models • Graph G=(V,E) where V represent the nodes, and E represent the edges. • If nodes p and q, then <p,q> denotes the edge from p to q. • I(v) denotes the in-neighbors of v • O(v) denotes the out-neighbors of v I(C)={A,B} and O(A)={C,D} C A D B E

Node pair graph • Creating a node pair graph G2 from G • <(p,q),(a,b)> is in G2 if <p,a> and <q,b> are in G • Example:

Simrank motivation • Intuition: similar objects are related to similar objects Univ=Univ Sim(Univ, Univ)=1 Prof. A related to Univ Prof. B related to Univ  Sim(Univ, Univ)=.414 <1 Student A related to Prof. A Student B related to Prof. B  Sim(SA, SB)=.331 <1

Simrank equation • Similarity btw a and b: • Example: • Assume C=1 F 1 S(F,D)= [S(A,A)+S(B,A)] A * |2|*|1| D B =1/2*(1+0.5)=0.75 E

Simrank equation (1) • Similarity btw a and b: • s(a,b) is symmetric • s(a,a)=1 • s(a,x)=0 if x has no neighbor

Simrank equation (2) • Similarity btw a and b: • s(a,b) is normalized into (0,1) • Proof: By induction • C<1 • s(Ii(a),Ij(b))<1 A B A B

Simrank equation (2) • Similarity btw a and b: • Factor C should be <1 • C represent the confidence level, propagated from the parent nodes

Bipartite Simrank • Consider a recommendation system: • How we can recommend a item to a new buyer? • A and B are similar since they both buy frosting and eggs  recommend flour for A

Bipartite Simrank (mutually-reinforcing rule) • Rule 1: People are similar if they purchase similar items • Rule 2: Items are similar if they are purchased by similar people • Rule 1 reinforces Rule 2, and vice versa • Example: • If frosting and eggs are similar, then • A and B also similar. • 2. If A and B are similar then frosting and eggs are similar. • Observation: We can magically see the • similar of sugar and flour, even though • there is no common customer.

Bipartite Simrank (formula) Rule 1: People are similar if they purchase similar items Rule 2: Items are similar if they are purchased by similar people Rule 1 (in math form) Rule 2 (in math form)

Bipartite Simrank (Homogeneous domain extension) • Previously: • Why use Out-links also  the extension: • Depend on the application, use either score or both (remember about HITS algorithm)

Minimax extension • Example: Given CS students A and B. • Both A, B take CS-required courses • For elective courses, A takes sociology • For elective courses, B takes English • Previously: • How to only compare A’s CS courses with B’s CS course and A’s elective courses with B elective courses??? • Meaningless to compare • A’s CS courses with B’s elective • course !!!

Minimax extension (Cont.) • Example: Given CS students A and B. • Both A, B take CS-required courses. For elective courses, A takes sociology and B takes English. Only compare a course of A with the most similar course of B

Naïve method to compute Simrank • Naïve method is an iterative method • Rk(a,b) stores similarity of a and b in iteration kth: • Initialize R0(a,b)=1 if a==b and R0(a,b)=0 o.w. • Update Rk+1 from Rk • Until converge

Time analysis of Naïve method • Assume there are n nodes in G  the required space is O(n2) to store pairs. • Assume d is the average of |I(a)||I(b)| each iteration take O(d) for each pair. • Assume K is the number of iterations • 1,2,3 time complexity is O(dn2K) • Empirical note: K≈5 in practice

Pruning to save time complexity • Previously, we assume the size of the node-pairs graph is n2  we consider all pairs. • In practice, given a node a, node v is far from a will have s(v,a)=0  it is efficient to consider only r-radius neighbor of a v v a a sk+1(a,v) = 0, since they are far way sk+1(a,v) = … sk

Time analysis of pruning • Previously, full n2 pairs O(dn2K) • Now, r-radius pairs O(dnrK) sk+1(a,v) = … sk v v a a sk+1(a,v) = 0, since they are far way

See how Simrank solve “limited information problem” • Limited information problem : • Find similar paper to A? • There is little information (only B cite A) • Among A1, A2,…, Am, which one is more similar to A? • Co-citation algorithm cannot solve LIP: • All A1, A2,…, Am share 1 common in link with A  they are equally similar to A • Simrankcansolve LIP!!! • A is cited by B’, and B’ is similar to B Am is more similar to A than other Ai Limited information problem

Random Surfer Pair model • Random surfer pair model provides an intuitive way of SimRank • Example: SimRank(m,d) can be explained in random walk: m d m d Case 1: high probability that m and d meet together in one step

Random Surfer Pair model (Cont.) a a m d m d Case 1: high probability that m and d meet together  SimRank(m,d) is high a y a m d m d Case 2: high probability that m and d meet together  SimRank(m,d) is lower

Random Surfer Pair model (Cont.) a y Step 1 How to compute m(m,d) m d SimRank(m,d)= expect meeting distance (m,d) = m(m,d) Step 2 where =

Experimental set up • Dataset: • Research Index dataset: papers and their citation • Almost 700,000 cross citations among 270,000 papers • Student and course dataset: students and their courses (bipartie graph) • 1030 students, each take around 40 courses

Experimental set up • Baseline method: • Co-citation: Measure the number of shared objects • How to evaluate the algorithm: • Select objects p • Select top N similar object • Average the similar scores of them, based on a domain specific measure

Experimental results

Trend on computing SimRank on MapReduce • Delta-SimRank Computing on MapReduce. Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. (BigMine’12). zeros We only need to send values greater than zeros  save communication cost over MapReduce!!!

Good points • The paper proposes a novel method to compute the similarity of objects, in general, based on the structure of data • The paper proposes a method to compute and efficient pruning technique • The paper provides an intuition for the method • There are good experiments results prove their idea

Weak points • Scalability: The paper should mention about very huge size graph. • It may incorporate distributed design. Since the algorithm is fixed point process, it should be a research problem on how to parallelize it.

Quiz • Intuitively, in which graph, the SimRank of a and b are higher ? a b a b

SimRank : A Measure of Structural-Context Similarity

SimRank : A Measure of Structural-Context Similarity

Presentation Transcript

Chemical Similarity – An overview

Chemical similarity with Toxmatch 1.03 and Ambit Discovery

Similarity and Difference

Word Meaning and Similarity

Weiren Yu 1,2 , Xuemin Lin 1 , Wenjie Zhang 1 1 University of New South Wales

An Overview of Similarity Query Processing

Effect of Linearization on Normalized Compression Distance

Distance Measures

Textual Spatial Cosine Similarity

Similarity and Denoising

Feature Based Similarity

Remote Homology Detection of Beta-Structural Motifs Using Random Fields

Chapter 8 Similarity

Lecture 22 Word Similarity

SELF-SIMILARITY MEASURE FOR ASSESSMENT OF IMAGE VISUAL QUALITY

Structural Funds – looking towards 2014/20

Detection of Plagiarism In University Projects Using Metrics-Based Similarity

Similarity Measure Based on Partial Information of Time Series