1 / 30

PageSim: A Novel Link-based Measure of Web Page Similarity

This novel approach, PageSim, utilizes link-based measures to determine the similarity between web pages, addressing challenges in finding related content on the vast and rapidly evolving web. By incorporating concepts like SimRank and PageRank, PageSim provides a more comprehensive evaluation of web page similarity. The approach considers factors such as authority, cocitation, and common neighbors to enhance the accuracy of similarity assessments. Explore how PageSim revolutionizes web page similarity measurement in this comprehensive overview.

kvrabel
Download Presentation

PageSim: A Novel Link-based Measure of Web Page Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PageSim: A Novel Link-based Measure of Web Page Similarity LIN Zhenjiang, 28 April 2006 zjlin@cse.cuhk.edu.hk http://www.cse.cuhk.edu.hk/~zjlin

  2. Outline • 1. Background • 2. Motivation • 3. Existing approaches • 4. PageSim: a new approach • 5. Demonstrations • 6. Conclusion and future work

  3. 1. Background I Mining the World-Wide Web I • Web mining -data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996). • Web mining research –integrate research from several research communities (Kosala and Blockeel, July 2000) such as: • Database (DB) • Information retrieval (IR) • The sub-areas of machine learning (ML) • Natural language processing (NLP)

  4. 1. Background II Mining the World-Wide Web II • WWW is huge, widely distributed, global information source for • Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. • Hyper-link information • Access and usage information • Web Site contents and Organization

  5. 1. Background III Mining the World-Wide Web III • Growing and changing very rapidly • Broad diversity of user communities • Only a small portion of the information on the Web is truly relevant or useful to Web users • How to find high-quality Web pages on a specified topic? • WWW provides rich sources for data mining

  6. 1. Background IV Challenges on the Web • Finding Relevant Information • Creating knowledge from Information available • Personalization of the information • Learning about customers / individual users • …

  7. 1. Background V Web Mining Taxonomy • Web Content Mining:extract/mine useful information or knowledge from web page contents, including text, image, audio, video, and metadata, etc. • Web Structure Mining:discover useful knowledge from the structure of hyperlinks. • Web Usage Mining:refers to the discovery of user access patterns from Web usage logs.

  8. 1. Background VI Web Structure Mining I • Hyperlinks can infer the notion of authority • The Web consists not only of pages, but also of hyperlinks pointing from one page to another • These hyperlinks contain an enormous amount of latent human annotation • A hyperlink pointing to another Web page, this can be considered as the author's endorsement of the other page.

  9. 1. Background VII Web Structure Mining II • Web pages categorization (Chakrabarti, et al., 1998) • Discovering micro-communities on the web - Example: Clever system (Chakrabarti, et al., 1999), Google (Brin and Page, 1998) • Schema Discovery in Semi-structured Environment (identify typical structuring info.)

  10. 2. Motivation I Finding related or similar web pages I • web search engines

  11. 2. Motivation II Finding related or similar web pages II • web document classification

  12. 3. Existing approaches I • Text-based • Classic IR, Jaccard’s coefficient, Adamic/Adar • Pure link-based • Single-step: cocitation, common neighbor, … • Multi-step: • Companion (Dean, Henzinger, 1998) • SimRank (Jeh, Widom, 2002) • Hybrid • Anchor text based (Haveliwala et al. 2002)

  13. 3. Existing approaches II • Notations • Sim(a,b): similarity score of web page a and b. • I(a): in-link neighbors of web page a. • O(a): out-link neighbors of web page a. • Common neighbor method • Sim(a,b) = |O(a)∩O(b)| = |(c,d)| = 2 • Cocitation method • Sim(a,b) = |I(a)∩I(b)| = |(c,d)| = 2

  14. 3. Existing approaches III • SimRank “two pages are similar if they are referenced (cited, or linked to) by similar pages” • (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition • C is a constant between 0 and 1. • The iteration starts with Sim(u,u)=1, Sim(u,v)=0if u≠v.

  15. 4. PageSim: a new approach I • Two considerations • On the Web, not all links are equally important. Common neighbor, cocitation • A similarity measure should be able to measure the similarity between any two web pages. SimRank • PageSim • Take the above problems into account.

  16. 4. PageSim: a new approach II • Cocitation • Which page is more similar to d, c or e? • Suppose page a is YAHOO!’s homepage, and b is a personal web page. Authoritative pages are more important.

  17. 4. PageSim: a new approach III • SimRank • Are a and b similar? • SimRank says “NO”s. Are the answers reasonable?

  18. 4. PageSim: a new approach IV • Page a linking to b and c means a “thinks” • b and c are similar. • both b and c are similar to a. • Intuitions • Page a spreads similarity to its neighbors. • Authoritative pages spread more similarity.

  19. 4. PageSim: a new approach V • PageSim • In PageSim, PageRank (PR) score is used to measure the authority of a web page. PR assigns global importance scores to all web pages. • Each page spreads its own similarity score (PR score) to its neighbors. • Each page also propagates other pages’ similarity scores to its neighbors. • After the similarity score propagation finished, each page contains an array of similarity scores. • PageRank score propagation

  20. 4. PageSim: a new approach VI • Example: similarity propagation (page a only) • PR(a)=100, PR(b)=55, PR(c)=102 • Each page propagate 80% of its similarity score averagely to its neighbors.

  21. 4. PageSim: a new approach VII • Example: similarity propagation II • PR(a)=100, PR(b)=55, PR(c)=102 • Each page contains a similarity score vector(SV). • SV(a) = (100, 35, 82 ), • SV(b) = ( 40, 55, 33 ), • SV(c) = ( 72, 44, 102 ), • PageSim score (PS) computation • PS(a,b)=Σmin( SV(a), SV(b) ) = 40+35+33 = 108 • Two pages are more similar if they share more common similarity scores.

  22. 4. PageSim: a new approach VIII • Example: similarity spreading III • PageSim score matrix • PS_matrix = (PS(u,v))nxn= a: 217 b: 108 128 c: 189 117 219 • PS_matrix is symmetric. • PS(a,b) = PS(b, a) • Any web page is most similar to itself. • PS(u,u) = max ( PS(u,v) ), for any v.

  23. 4. PageSim: a new approach IX • Propagation radius pruning I • The time complexity of propagating one page’s similarity score to all the others is O(kn), where k is the average number of out-links. • Similarity score propagated to distant pages is too small to be omitted. • Reducing complexity of propagation to O(kr) by limiting the radius of propagation to r.

  24. 4. PageSim: a new approach X • Propagation radius pruning II • Real data (CSE homepage) and synthetic data

  25. 5. Demonstrations I • Example 1: single link • PageSim matrixa: 100b: 80 265c: 64212469.2d: 51.2 169.6 375.4694.1 • PR = (100, 185, 257.2, 318.6) • SimRank matrix1 0 1 0 0 1 0 0 0 1

  26. 5. Demonstrations II • Example 2: loop link • PageSim matrixa: 295.2b: 246.4 295.2 c: 230.4 246.4 295.2d: 246.4 230.4 246.4 295.2 • PR = (100, 100, 100, 100) • SimRank matrix1 0 1 0 0 1 0 0 0 1

  27. 5. Demonstrations III • Example 3: more complex • PageSim matrix1: 100.02: 40.0 487.63: 50.7 159.4 397.44: 10.7 238.5 130.0 275.55: 10.7 130.0 130.0 130.0 314.9PR = (100, 40.0, 50.7, 10.7, 10.7) • SimRank matrix1: 12: 0 1 3: 0 0.25 14: 0 0 0.5 15: 0 0 0.5 1 1 • PageSim results • v3 is most similar to v1. • v4 is most similar to v2.

  28. 6. Conclusion and future work I • Conclusion • Web Mining • Web page similarity measuresText-based, Link-based, and Hybrid • PageSim: PageRank score propagation. • Propagation radius pruning • PageSim vs SimRank

  29. 6. Conclusion and future work II • Future work • Evaluation of PageSim • Taking traditional text-based similarity measure TFIDF as ground truth. • Efficiency of computation • Since computing PageSim score of two web pages is O(n), computing all n2 pairs of pages is O(n3). • Storage issue • Since each page needs an array of length n to store similarity scores issued from all web pages, the storage needed by PageSim is O(n2).

  30. Q & A Thank you!

More Related