170 likes | 446 Views
Block-level Link Analysis. Presented by Lan Nie 11 / 08 /2005, Lehigh University. Introduction. Web page often contains multiple semantics Different parts of the page have different importance and topic Links contained in different semantic blocks point to pages of different topics
E N D
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University
Introduction • Web page often contains multiple semantics • Different parts of the page have different importance and topic • Links contained in different semantic blocks point to pages of different topics • Importance of page may be mis-calculated by PageRank and topic drift may happen in HITS • Split page into semantic blocks • Apply link analysis on block-level
Vision-Based Page Segmentation Construct a semantic tree for a page based on layout structure • Extract blocks from the html DOM tree • Constructed blocks into a semantic tree based on seperators • Node: block with a value (DOC) to indicate how coherent of the content in the block.
Block Level Web GraphP: set of all the pages B: set of all the blocks X: page-to-block matrix (layout structure) f is block importance function: big size and centered position vs small size and margin position Z: block-to-page matrix (link structure) Is the number of pages that block i links to
WP:Page-to-Page Graph A weighted adjacency matrix: Links in blocks with high importance value get more weights than those in blocks with low importance value
WB:Block-to-Block Graph (didn’t use in this paper) Extension: the probability of jump from a block a to block b within a page is DOC value of the smallest block containing both block a and block b
Block Level Page Rank(BLPR) • Apply PageRank on weighted adjacency matrix WP • Edge is weighted by block’s importance value. • Pages pointed by advertisement hyperlinks might not be assigned a large score since such links are always in less important blocks. • Block level PageRank can reflect the semantic structure of the web
Block level HITS(BLHITS) • Apply HITS on block-to-page matrix Z • A page will have only authority score A and a block will have only hub score H • Different parts of the page are treated differently, thus the links in these hubs are treated differently.
Main difference between BLHITS and HITS • Links from blocks to pages vs Links from pages to pages • Root set is made up of top ranked blocks rather than top ranked pages. • When expanding the root set, only consider out-links contained in top ranked blocks of a page instead of all links. • Combine content analysis in block-level instead of page-level. • Weight links: importance value of the block /maximum block importance value
Experiments • DataSet: TREC2003 • Relevance weighting: BM2500 • PR and BLPR • HITS and BLHITS • Size of rootset:200 • In-link parameter d:50 • Adopting Bharat and Henzinger’s idea • Eliminate mutually reinforcing relationship between hosts • Combine connectivity and content analysis
Results on PR & BLPR 1. First 15 pages in .GOV dataset
2. Results on TREC2003 Combine relevance score (using BM2500) and importance score (using ranking algorithm)