150 likes | 162 Views
Explore how different semantic blocks on a web page affect its importance and topic relevance using link analysis at a block level. This study presents a method to segment web pages into semantic blocks, construct a semantic tree, and apply PageRank and HITS algorithms to understand the semantic structure of web pages. Experiments conducted on TREC2003 dataset show the effectiveness of block-level analysis in improving search relevance.
E N D
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University
Introduction • Web page often contains multiple semantics • Different parts of the page have different importance and topic • Links contained in different semantic blocks point to pages of different topics • Importance of page may be mis-calculated by PageRank and topic drift may happen in HITS • Split page into semantic blocks • Apply link analysis on block-level
Vision-Based Page Segmentation Construct a semantic tree for a page based on layout structure • Extract blocks from the html DOM tree • Constructed blocks into a semantic tree based on seperators • Node: block with a value (DOC) to indicate how coherent of the content in the block.
Block Level Web GraphP: set of all the pages B: set of all the blocks X: page-to-block matrix (layout structure) f is block importance function: big size and centered position vs small size and margin position Z: block-to-page matrix (link structure) Is the number of pages that block i links to
WP:Page-to-Page Graph A weighted adjacency matrix: Links in blocks with high importance value get more weights than those in blocks with low importance value
WB:Block-to-Block Graph (didn’t use in this paper) Extension: the probability of jump from a block a to block b within a page is DOC value of the smallest block containing both block a and block b
Block Level Page Rank(BLPR) • Apply PageRank on weighted adjacency matrix WP • Edge is weighted by block’s importance value. • Pages pointed by advertisement hyperlinks might not be assigned a large score since such links are always in less important blocks. • Block level PageRank can reflect the semantic structure of the web
Block level HITS(BLHITS) • Apply HITS on block-to-page matrix Z • A page will have only authority score A and a block will have only hub score H • Different parts of the page are treated differently, thus the links in these hubs are treated differently.
Main difference between BLHITS and HITS • Links from blocks to pages vs Links from pages to pages • Root set is made up of top ranked blocks rather than top ranked pages. • When expanding the root set, only consider out-links contained in top ranked blocks of a page instead of all links. • Combine content analysis in block-level instead of page-level. • Weight links: importance value of the block /maximum block importance value
Experiments • DataSet: TREC2003 • Relevance weighting: BM2500 • PR and BLPR • HITS and BLHITS • Size of rootset:200 • In-link parameter d:50 • Adopting Bharat and Henzinger’s idea • Eliminate mutually reinforcing relationship between hosts • Combine connectivity and content analysis
Results on PR & BLPR 1. First 15 pages in .GOV dataset
2. Results on TREC2003 Combine relevance score (using BM2500) and importance score (using ranking algorithm)