1 / 18

PageSim: A Link-based Measure of Web Page Similarity

PageSim: A Link-based Measure of Web Page Similarity. Research Group Presentation Allen Z. Lin, 8 Mar 2006. Outline. What & Why? Existing approaches PageSim: a new approach Demostrations Conclusion and current work. What & Why?. Ranking similarity between web pages.

asis
Download Presentation

PageSim: A Link-based Measure of Web Page Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006

  2. Outline • What & Why? • Existing approaches • PageSim: a new approach • Demostrations • Conclusion and current work

  3. What & Why? • Ranking similarity between web pages. • Applications on the Web • Finding related, or similar, web pages to a page. Google’s “Similar pages” • Web page classification. YAHOO!‘s Web Directory. http://dir.yahoo.com/ hierarchical structure • Key question:How to measure the similarity?

  4. Existing approaches • Text-based • Using common features of two web pages. Jaccard’s coefficient, Adamic/Adar • Link-based • Using neighbors between two web pages. Common neighbor, Co-citation, SimRank • Using paths between two web pages. Katz index, Hitting time

  5. Existing approaches (cont.) • Notations • Sim(a,b): similarity score of web page a and b. • I(a): in-link neighbors of web page a. • O(a): out-link neighbors of web page a. • Common neighbor method • Sim(a,b) = |O(a)∩O(b)| = |(c,d)| = 2 • Cocitation method • Sim(a,b) = |I(a)∩I(b)| = |(c,d)| = 2

  6. Existing approaches (cont.) • SimRank • Two pages are similar if they are referenced (cited, or linked to) by similar pages. • 1. Sim(u,u)=1; 2. Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition • C is a constant between 0 and 1. • The iteration starts with Sim(u,u)=1, Sim(u,v)=0if u≠v.

  7. PageSim: a new approach • Two problems • On the Web, not all links are equally important. Common neighbor, Cocitation • A similarity measure should be able to measure the similarity between any two web pages. SimRank • PageSim • Take the above problems into account.

  8. PageSim: a new approach (cont.) • Cocitation • Which page is more similar to d, c or e? • Suppose page a is YAHOO!’s homepage, and b is a personal web page. Authoritative pages are more important.

  9. PageSim: a new approach (cont.) • SimRank • Are a and b similar? • SimRank says “NO”s. Are the answers reasonable?

  10. PageSim: a new approach (cont.) • Page a linking to b and c means a “thinks” • b and c are kind of similar. • both b and c are kind of similar to a too. • Page a spreads similarity to its neighbors. • Authoritative pages spread more similarity.

  11. PageSim: a new approach (cont.) • PageSim • In PageSim, PageRank (PR) score is used to measure the authority of a web page. PR assigns global importance scores to all web pages. • Each page spreads its own similarity score (PR score) to its neighbors. • Each page also propagates other pages’ similarity scores to its neighbors. • After the similarity score propagation finished, each page contains an array of similarity scores. • PageRank score propagation

  12. PageSim: a new approach (cont.) • Example: similarity propagation (page a only) • PR(a)=100, PR(b)=55, PR(c)=102 • Each page propagate 80% of its similarity score averagely to its neighbors.

  13. PageSim: a new approach (cont.) • Example: similarity propagation (cont.) • PR(a)=100, PR(b)=55, PR(c)=102 • Each page contains a similarity score vector(SV). • SV(a) = (100, 35, 82 ), • SV(b) = ( 40, 55, 33 ), • SV(c) = ( 72, 44, 102 ), • PageSim score (PS) computation • PS(a,b)=Σmin( SV(a), SV(b) ) = 40+35+33 = 108 • Two pages are more similar if they share more common similarity scores.

  14. PageSim: a new approach (cont.) • Example: similarity spreading (cont.) • PageSim score matrix • PS_matrix = (PS(u,v))nxn=a: 217 b: 108 128 c: 189 117 219 • PS_matrix is symmetric. • PS(a,b) = PS(b, a) • Any web page is most similar to itself. • PS(u,u) = max ( PS(u,v) ), for any v.

  15. Demostrations • Example 1: single link • PageSim matrixa: 100b: 80 265c: 64212469.2d: 51.2 169.6 375.4694.1 • PR = (100, 185, 257.2, 318.6) • SimRank matrix1 0 1 0 0 1 0 0 0 1

  16. Demostrations (cont.) • Example 2: loop link • PageSim matrixa: 295.2b: 246.4 295.2 c: 230.4 246.4 295.2d: 246.4 230.4 246.4 295.2 • PR = (100, 100, 100, 100) • SimRank matrix1 0 1 0 0 1 0 0 0 1

  17. Demostrations (cont.) • Example 3: more complex • PageSim matrix1: 100.02: 40.0 487.63: 50.7 159.4 397.44: 10.7 238.5 130.0 275.55: 10.7 130.0 130.0 130.0 314.9PR = (100, 40.0, 50.7, 10.7, 10.7) • SimRank matrix1: 1 2: 0 1 3: 0 0.25 14: 0 0 0.5 15: 0 0 0.5 1 1 • PageSim results • v3 is most similar to v1. • v4 is most similar to v2.

  18. Conclusion and current work • Conclusion • Web page similarity measuresText-based & Link-based • PageSim: PageRank score propagation. • Current work • Propagation radius pruning. • How to compare performance of two similarity measures, e.g., PageSim and SimRank? Text-based measures. Thank you!

More Related