180 likes | 336 Views
PageSim: A Link-based Measure of Web Page Similarity. Research Group Presentation Allen Z. Lin, 8 Mar 2006. Outline. What & Why? Existing approaches PageSim: a new approach Demostrations Conclusion and current work. What & Why?. Ranking similarity between web pages.
E N D
PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006
Outline • What & Why? • Existing approaches • PageSim: a new approach • Demostrations • Conclusion and current work
What & Why? • Ranking similarity between web pages. • Applications on the Web • Finding related, or similar, web pages to a page. Google’s “Similar pages” • Web page classification. YAHOO!‘s Web Directory. http://dir.yahoo.com/ hierarchical structure • Key question:How to measure the similarity?
Existing approaches • Text-based • Using common features of two web pages. Jaccard’s coefficient, Adamic/Adar • Link-based • Using neighbors between two web pages. Common neighbor, Co-citation, SimRank • Using paths between two web pages. Katz index, Hitting time
Existing approaches (cont.) • Notations • Sim(a,b): similarity score of web page a and b. • I(a): in-link neighbors of web page a. • O(a): out-link neighbors of web page a. • Common neighbor method • Sim(a,b) = |O(a)∩O(b)| = |(c,d)| = 2 • Cocitation method • Sim(a,b) = |I(a)∩I(b)| = |(c,d)| = 2
Existing approaches (cont.) • SimRank • Two pages are similar if they are referenced (cited, or linked to) by similar pages. • 1. Sim(u,u)=1; 2. Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition • C is a constant between 0 and 1. • The iteration starts with Sim(u,u)=1, Sim(u,v)=0if u≠v.
PageSim: a new approach • Two problems • On the Web, not all links are equally important. Common neighbor, Cocitation • A similarity measure should be able to measure the similarity between any two web pages. SimRank • PageSim • Take the above problems into account.
PageSim: a new approach (cont.) • Cocitation • Which page is more similar to d, c or e? • Suppose page a is YAHOO!’s homepage, and b is a personal web page. Authoritative pages are more important.
PageSim: a new approach (cont.) • SimRank • Are a and b similar? • SimRank says “NO”s. Are the answers reasonable?
PageSim: a new approach (cont.) • Page a linking to b and c means a “thinks” • b and c are kind of similar. • both b and c are kind of similar to a too. • Page a spreads similarity to its neighbors. • Authoritative pages spread more similarity.
PageSim: a new approach (cont.) • PageSim • In PageSim, PageRank (PR) score is used to measure the authority of a web page. PR assigns global importance scores to all web pages. • Each page spreads its own similarity score (PR score) to its neighbors. • Each page also propagates other pages’ similarity scores to its neighbors. • After the similarity score propagation finished, each page contains an array of similarity scores. • PageRank score propagation
PageSim: a new approach (cont.) • Example: similarity propagation (page a only) • PR(a)=100, PR(b)=55, PR(c)=102 • Each page propagate 80% of its similarity score averagely to its neighbors.
PageSim: a new approach (cont.) • Example: similarity propagation (cont.) • PR(a)=100, PR(b)=55, PR(c)=102 • Each page contains a similarity score vector(SV). • SV(a) = (100, 35, 82 ), • SV(b) = ( 40, 55, 33 ), • SV(c) = ( 72, 44, 102 ), • PageSim score (PS) computation • PS(a,b)=Σmin( SV(a), SV(b) ) = 40+35+33 = 108 • Two pages are more similar if they share more common similarity scores.
PageSim: a new approach (cont.) • Example: similarity spreading (cont.) • PageSim score matrix • PS_matrix = (PS(u,v))nxn=a: 217 b: 108 128 c: 189 117 219 • PS_matrix is symmetric. • PS(a,b) = PS(b, a) • Any web page is most similar to itself. • PS(u,u) = max ( PS(u,v) ), for any v.
Demostrations • Example 1: single link • PageSim matrixa: 100b: 80 265c: 64212469.2d: 51.2 169.6 375.4694.1 • PR = (100, 185, 257.2, 318.6) • SimRank matrix1 0 1 0 0 1 0 0 0 1
Demostrations (cont.) • Example 2: loop link • PageSim matrixa: 295.2b: 246.4 295.2 c: 230.4 246.4 295.2d: 246.4 230.4 246.4 295.2 • PR = (100, 100, 100, 100) • SimRank matrix1 0 1 0 0 1 0 0 0 1
Demostrations (cont.) • Example 3: more complex • PageSim matrix1: 100.02: 40.0 487.63: 50.7 159.4 397.44: 10.7 238.5 130.0 275.55: 10.7 130.0 130.0 130.0 314.9PR = (100, 40.0, 50.7, 10.7, 10.7) • SimRank matrix1: 1 2: 0 1 3: 0 0.25 14: 0 0 0.5 15: 0 0 0.5 1 1 • PageSim results • v3 is most similar to v1. • v4 is most similar to v2.
Conclusion and current work • Conclusion • Web page similarity measuresText-based & Link-based • PageSim: PageRank score propagation. • Current work • Propagation radius pruning. • How to compare performance of two similarity measures, e.g., PageSim and SimRank? Text-based measures. Thank you!