Using Hyperlink structure information for web search

Using Hyperlink structure information for web search

Hyperlink structure information • Hyperlink analysis for the web by Monika R. Henzinger, Google Inc. • Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas

How are hyperlinks useful? • Assumptions a)Assumption 1. A hyperlink from page A to page B is a recommendation of page B by the author of page A. b) Assumption 2. If page A and page B are connected by a hyperlink, then they might be on the same topic. c) Pages pointed by many pages are of higher quality than pages pointed to by fewer pages.

main uses of hyperlink analysis • crawling (collecting the pages) • ranking (rank the returned results) • Compute the geographic scope of a web page • Find mirrored host • Compute the statistics of web pages and search engine • Major search engine use hyperlink analysis but do not want to disclose the algorithms

Crawling • Collect web pages • Start with a set of pages, recursively visit the hyperlinks

Traditional IR • Vector model or Boolean model • Does not work well in the web because: Web page authors manipulate the ranking. • The power of hyperlink analysis comes from the fact that it uses the content of other pages to rank the current page.

Connectivity-Based Ranking(rank using hyperlink analysis) • query-independent schemes, which assign a score to a page independent of a given query; • query-dependent schemes, which assign a score to a page in the context of a given query.

Model • Web pages as graph, page as node, hyperlink as edge. • Directed graph: link graph. Used for finding related pages • Undirected graph: co-citation graph. Used for categorizing related pages.

Query-independent Ranking • Major drawbacks: it does not distinguish between the quality of a page pointed by a number of low-quality pages and the quality of a page pointed to by the same number of high-quality page. • PageRank algorithm. Weight each hyperlink to the page proportionally to the quality of the page containing the hyperlink. PageRank of a page A depends on the pagerank of a page B pointing to A. Used by Google.

Query-dependent Ranking • Build query-specific graph: neighborhood graph. • Start set of documents matching the query • Augmented by the sets of the documents that either hyperlinks to or is hyper linked to by the documents in the start set. • Perform the hyperlink analysis.

Query-dependent Ranking(continued) • Indegree-based approach. (the number of documents hyper linking to a document in the start set) • Authorities (pages with good content on a topic) and hubs (directory-like pages with many hyperlinks to pages on the topic) • HITS algorithm to determine good hubs and good authorities. Each node has auth score and hub score.

Problems of HITS • Small additions to neighborhood graph may considerably change the scores of hub and auth. • Topic drift when the majority of pages on neighborhood graph is on a topic different from the query topic.

Structural web search using a graph-based discovery system • WebSUBDUE: SUBDUE is the engine for knowledge discovery(data mining). Support structural search, text search, synonym search, and combinations of these searches. • Data preparation: Crawler written in Perl to build the labeled graph for the web site. • Labeled graph is feed into SUDUE system. • Query can be modeled as labeled graph as well. • Search the sub graph in the graph • Make comparison with existing search engine: AltaVista

Find all pages that link to a page containing the term subdue

Jobs in computer science

Find hubs and authorities pages on “algorithm”

Conclusion • Hyperlink structure information is valuable information. • Use of hyperlink information to enhance normal web search in crawling, ranking etc. • Use of hyperlink information to support structural search, which is still missing in existing search engine.

Using Hyperlink structure information for web search

Using Hyperlink structure information for web search

Presentation Transcript

INFORMATION RETRIEVAL AND WEB SEARCH

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Using Hyperlink structure information for web search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Hyperlink Analysis for the Web

Part I: Web Structure Mining Chapter 2: Hyperlink Based Ranking

Information Retrieval and Web Search

Information Retrieval and Web Search

Using Web Structure for Classifying and Describing Web Pages

Web Enabled Expert Systems using Hyperlink-based Inference

Hyperlink

Information Retrieval and Web Search

Information Retrieval and Web Search

Sending Data Using a Hyperlink

Web Search and Information Retrieval

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search