1 / 12

The Structure of Broad Topics on the Web

The Structure of Broad Topics on the Web. Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai. Introduction & Contribution. Convergence of topic distribution on undirected random walks Degree distribution restricted to topics How topic-biased are breadth-first crawls?

darrin
Download Presentation

The Structure of Broad Topics on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai

  2. Introduction & Contribution • Convergence of topic distribution on undirected random walks • Degree distribution restricted to topics • How topic-biased are breadth-first crawls? • Representation of topics in Web directories • Topic convergence on directed walks • Link-based vs. content-based Web communities

  3. Building Blocks • Sampling Web pages • PageRank-based random walk  Wander walk • The Bar-Yossef random walk  Sampling walk • Undirected graph • Regular • Taxonomy design & Document classification • 271,954 topics, 6 levels, 1,697,266 sample URLs • Pruned: taxonomy 482 leaf nodes, 144,859 sample URLs • Classification: Rainbow naïve Bayes classifier

  4. Convergence • Sampling method • Sampling walk • Topic distribution of a set • Soft counting • Difference measure • L1 distance

  5. The background distribution vs. breadth-first crawls

  6. Faithful representation of topics in Web directory

  7. Topic-specific degree distributions • Power law distribution • Pr(i) = k*1/ix (x>1) • Contribution to Class c • Soft-counting • Δd pc(d)

  8. Topical locality and link-based prestige ranking • Sampling method • Wander walk • Class selection • Dmoz, well-populated • Collect all the pages at distance i (i>0)

  9. Topical locality and link-based prestige ranking

  10. Relations between topics • Topic citation matrix • Contribution to topic citation matrix C • C  C + p(u)T p(v) • Implications and application • Improved hypertext classification • Enhanced focused crawling • Reorganizing topic directories

  11. Concluding remarks • Characterize some important notions of topical locality on the web • Open problems • PageRank jump parameter • Topical stability of distillation algorithms • Better crawling algorithms

  12. Q & A?

More Related