210 likes | 341 Views
Using Hyperlink structure information for web search. Hyperlink structure information. Hyperlink analysis for the web by Monika R. Henzinger, Google Inc. Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas . How are hyperlinks useful?.
E N D
Hyperlink structure information • Hyperlink analysis for the web by Monika R. Henzinger, Google Inc. • Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas
How are hyperlinks useful? • Assumptions a)Assumption 1. A hyperlink from page A to page B is a recommendation of page B by the author of page A. b) Assumption 2. If page A and page B are connected by a hyperlink, then they might be on the same topic. c) Pages pointed by many pages are of higher quality than pages pointed to by fewer pages.
main uses of hyperlink analysis • crawling (collecting the pages) • ranking (rank the returned results) • Compute the geographic scope of a web page • Find mirrored host • Compute the statistics of web pages and search engine • Major search engine use hyperlink analysis but do not want to disclose the algorithms
Crawling • Collect web pages • Start with a set of pages, recursively visit the hyperlinks
Traditional IR • Vector model or Boolean model • Does not work well in the web because: Web page authors manipulate the ranking. • The power of hyperlink analysis comes from the fact that it uses the content of other pages to rank the current page.
Connectivity-Based Ranking(rank using hyperlink analysis) • query-independent schemes, which assign a score to a page independent of a given query; • query-dependent schemes, which assign a score to a page in the context of a given query.
Model • Web pages as graph, page as node, hyperlink as edge. • Directed graph: link graph. Used for finding related pages • Undirected graph: co-citation graph. Used for categorizing related pages.
Query-independent Ranking • Major drawbacks: it does not distinguish between the quality of a page pointed by a number of low-quality pages and the quality of a page pointed to by the same number of high-quality page. • PageRank algorithm. Weight each hyperlink to the page proportionally to the quality of the page containing the hyperlink. PageRank of a page A depends on the pagerank of a page B pointing to A. Used by Google.
Query-dependent Ranking • Build query-specific graph: neighborhood graph. • Start set of documents matching the query • Augmented by the sets of the documents that either hyperlinks to or is hyper linked to by the documents in the start set. • Perform the hyperlink analysis.
Query-dependent Ranking(continued) • Indegree-based approach. (the number of documents hyper linking to a document in the start set) • Authorities (pages with good content on a topic) and hubs (directory-like pages with many hyperlinks to pages on the topic) • HITS algorithm to determine good hubs and good authorities. Each node has auth score and hub score.
Problems of HITS • Small additions to neighborhood graph may considerably change the scores of hub and auth. • Topic drift when the majority of pages on neighborhood graph is on a topic different from the query topic.
Structural web search using a graph-based discovery system • WebSUBDUE: SUBDUE is the engine for knowledge discovery(data mining). Support structural search, text search, synonym search, and combinations of these searches. • Data preparation: Crawler written in Perl to build the labeled graph for the web site. • Labeled graph is feed into SUDUE system. • Query can be modeled as labeled graph as well. • Search the sub graph in the graph • Make comparison with existing search engine: AltaVista
Find all pages that link to a page containing the term subdue
Conclusion • Hyperlink structure information is valuable information. • Use of hyperlink information to enhance normal web search in crawling, ranking etc. • Use of hyperlink information to support structural search, which is still missing in existing search engine.