120 likes | 224 Views
The Structure of Broad Topics on the Web. Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai. Introduction & Contribution. Convergence of topic distribution on undirected random walks Degree distribution restricted to topics How topic-biased are breadth-first crawls?
E N D
The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai
Introduction & Contribution • Convergence of topic distribution on undirected random walks • Degree distribution restricted to topics • How topic-biased are breadth-first crawls? • Representation of topics in Web directories • Topic convergence on directed walks • Link-based vs. content-based Web communities
Building Blocks • Sampling Web pages • PageRank-based random walk Wander walk • The Bar-Yossef random walk Sampling walk • Undirected graph • Regular • Taxonomy design & Document classification • 271,954 topics, 6 levels, 1,697,266 sample URLs • Pruned: taxonomy 482 leaf nodes, 144,859 sample URLs • Classification: Rainbow naïve Bayes classifier
Convergence • Sampling method • Sampling walk • Topic distribution of a set • Soft counting • Difference measure • L1 distance
Topic-specific degree distributions • Power law distribution • Pr(i) = k*1/ix (x>1) • Contribution to Class c • Soft-counting • Δd pc(d)
Topical locality and link-based prestige ranking • Sampling method • Wander walk • Class selection • Dmoz, well-populated • Collect all the pages at distance i (i>0)
Relations between topics • Topic citation matrix • Contribution to topic citation matrix C • C C + p(u)T p(v) • Implications and application • Improved hypertext classification • Enhanced focused crawling • Reorganizing topic directories
Concluding remarks • Characterize some important notions of topical locality on the web • Open problems • PageRank jump parameter • Topical stability of distillation algorithms • Better crawling algorithms