420 likes | 446 Views
Delve into the challenges of answering queries on the web to enhance search quality, focusing on authoritative pages. Learn about query types, link structures, analyzing authority, and computing hubs and authorities to improve search accuracy.
E N D
Authoritative Sources in Hyperlinked EnvironmentJon M. Kleinberg JACM 1999 Presented By Raman Adaikkalavan Feb 23, 2005, CSE 6392 Instructor: Dr. Gautam Das
Overview • Problem – in general • Query Types • Problems of Answering Queries • Authoritative Pages – Broad-topic queries • Iterate Method/Algorithm • Similar Page Queries • Multiple Sets of Hubs and Authorities • Diffusion and Generalization • Evaluation • Comparison – ? • Conclusion
Problem – in general • Searching on the www for discovering pages that are relevant to a given query • Improving Quality of search
Query Types • Does Netscape support the JDK 1.1 code signing API • Specific queries • Find information about the Java programming language • Broad-topic queries • Find pages ‘similar’ to java.sun.com • Similar-page queries
Problems with Answering Queries • Specific queries • Scarcity problem: Very few pages that contain required information • Difficult to determine the identity of the pages • Broad-topic queries • Abundance problem: Number of pages that could reasonably be returned as relevant is far too large for a human user to digest • Select a small set of the most “authoritative” or “definitive” ones – pages that are most relevant
Authoritative Pages – Central focus • Given a query how to get the small set of authoritative pages corresponding to that query • How to accurately model authority in the context of a particularquery topic • Text-based searching/ranking – Sufficient ? Many prominent pages are not sufficiently self-descriptive. • “harvard” – www.harvard.edu • “search engines” – Yahoo, AltaVista, …? • “automobile manufacturers” – Honda, Toyota, …?
Analysis of the Link Structure • Hyperlinks encode a considerable amount of latent human judgment – used for authority ? • e.g., the creator of page p, by including a link to page q, has in some measure conferred authority on q • a large number of links are created primarily for navigational purposes, back • Links to paid advertisements • relevance and popularity • Find pages using #inlinks – this would consider highly popular pages as authoritative – Yahoo.com
Conferral of Authority • Model that consistently identifies relevant, authoritative www pages for broad search topics • Based on the relationship between ‘authorities’ and ‘hubs’ • Authorities: Pages that have relevant information about a given topic • Hubs: Pages that link to many related authorities
Till Now WWW • Authoritativepages • Not only based on text • Usinglinkanalysis Information about Java PL (Broad Topic Queries)
Can We Operate Over Entire WWW ? • Specific to a query; i.e., not predefined • Computational costs – should be reduced • Analysis of the link structure; which subgraph www should be operated on ? • All pages containing query string • May be over million pages - computation • Some or most of the best authorities may not belong to this set
Finding Authoritative Pages • Steps • 1: Construct a focused subgraph (S) of the www; such that • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities • 2: Compute Hubs and Authorities from the focused subgraph
Expanded Set Pages S t highest-ranked pages Rootset R Topic Search Engine At most d pages Forward link pages Backward link pages Construction of Focused Subgraph
Offsetting Navigational Links • G[S] subgraph induced on the pages in S • Types of links • Transverse: if between pages with different domain names • Intrinsic: is between pages within the same domain name • Delete Intrinsic Links from G[S]; resulting in a graph G • Collusion: large # of pages from a single domain all point to a single page p. “This site is designated to…” Eliminate by a parameter m (approx 4 – 8)
Finding Authoritative Pages • Steps • 1: Construct a focused subgraph (S) of the www • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities • 2: Compute Hubs and Authorities from the focused subgraph
Computing Hubs & Authorities • Goal: Given a query find: • Good sources of content (authorities) • Good sources of links (hubs) FROM: Monika Henzinger, Hyperlink Analysis on the Web
Intuition • Authoritycomes from in-edges. Being a goodhubcomes from out-edges. • Better authoritycomes from in-edges from good hubs. Being a better hubcomes from out-edges to good authorities. FROM: Monika Henzinger, Hyperlink Analysis on the Web
Hubs and Authorities • An iterative algorithm • with each page p, we associate • a non-negative authority weight x<p> • a non-negative hub weight y<p> • weights of each type are normalized so their squares sum to 1 • pS(x<p>)2 = 1 pS(y<p>)2 = 1 • The pages with larger x and yvalues have “better” authorities and hubs respectively.
Hubs and Authorities • If ppoints to many pages with large x-values, then it should receive a large y-value • If p is pointed to by many pages with large y-values, then it should receive a large x-value • Inlinks I: • Outlinks O:
Hubs and Authorities • As one applies Iterate with arbitrary large k, the {xk} and {yk}converge to fixed points x* and y* • Let G = (V, E), with V = {p1, p2,…, pn}, and let A denote the adjacency matrix of the graph G: the (i, j)th entry of A is 1 if (pi, pj) is an edge of G, and is 0 otherwise. • x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT • The convergence of Iterate is quite rapid (k=20 is sufficient)
X X Y Y Z Z é é ù ù 1 1 1 0 1 1 X X ê ê ú ú ê ê ú ú = = M T M 0 1 0 0 1 1 Y Y ê ê ú ú ê ê ú ú 1 1 1 1 0 0 Z Z ê ê ú ú ë ë û û T = H H M M * i - i 1 T = A M M A * * - i i 1 Mini Web (Modified) Forward links Backward links HUBS AUTHORITIES X = H M A * - i i 1 T = A M H Z Y * - i i 1 SOURCE: Vagelis H, Random Walks Presentation
T = A M M A * * - i i 1 X is the best hub Z is the most authoritative Mini Web (Modified) T = H H M M * i - i 1 é ù é ù 2 2 1 3 1 2 ê ú ê ú ê ú ê ú = = 2 2 1 1 1 0 T T M M M M ê ú ê ú ê ú ê ú 1 1 2 2 0 2 ê ú ê ú ë û ë û ¥ Iteration 1 2 3 … X Z Y SOURCE: Vagelis H, Random Walks Presentation
Observations • Just “pure” analysis of link structure • i.e., text-based search is just an initial set • Pages legitimately considered as authoritative in the context of www without access to large-scale index of the www • i.e., global analysis of the full www link structure can be replaced by local method over small focused subgraph
Overview • Problem – in general • Query Types • Problems of Answering Queries • Authoritative Pages – Broad-topic queries • Iterate Method/Algorithm • Similar Page Queries • Multiple Sets of Hubs and Authorities • Diffusion and Generalization • Evaluation • Comparison • Conclusion
Similar-Page Queries • E.g., Find pages ‘similar’ to honda.com • Using links analysis to infer a notion of “similarity” among pages • We have found a page p that is of interest and it’s an authoritative page on a topic. • What do users of the WWW consider to be related to p when they create pages and links ? • If p is highly referenced ? – abundance problem
Similar-Page Queries • In the local region of the link structure nearp, what are the strongest authorities • Can be a potential broad-topic summary of pages related to p • Normal Search; a query string - “Find t pages containing ” as R and then get subgraph S • a page p -- “Find t pages pointing to p” as R and then get subgraph S
Multiple Sets of Hubs and Authorities • Broad-topic queries: most densely linked collection of hubs and authorities • Can we can find several densely linked collections of hubs and authorities among the same set S of pages. • Each collection could potentially be relevant to the query topic, but they could well-separated from one another in the graph G: • The query string may have several very different meanings. E.g. “jaguar”, “java”. • The string may arise as a term in the context of multiple technical communities. E.g. “randomized algorithms”. • The string may refer to a highly polarized issue, involving groups that are not likely to link to one another. E.g. “abortion”.
Multiple Sets of Hubs and Authorities • Relevant documents can be grouped in to several clusters • For Broad-topic Queries: x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT • Can we use the non-principal eigenvectors to extract additional densely linked collections of hubs and authorities • Positive and Negative
Diffusion and Generalization • Diffusion happens • if the specifies a topic that is not sufficiently broad, there will be not enough relevant pages in G • the most relevant collection in G is not the “densest” one • as a result the I and O operations will find the diffused collection of authority corresponding to the “broader” topics • Limits the algorithm • The broader topic that supplants the original, too-specific query very often represents a natural generalization of • It provides a simple way of abstracting a specific query topic to a broader related one.
Results – Diffusion & Generalization • The use of non-principal eigenvectors, combined with basic term-matching, can be a simple way to extract collections of authoritative pages that are more relevant to a specific query topic
Evaluation • 26 broad search topics, 37 users • For each topic, took the top 10 pages from AltaVista, the top five hubs and five authorities from Clever, and a random set of 10 pages from Yahoo • The results • For 31% of the topics, Yahoo and Clever received evaluations equivalent to each other • For 50%, Clever received a higher evaluation • For 19%, Yahoo received the higher evaluation
Summary • Answering Broad-topic queries • Finding Authoritative Pages using the good hubs and good authorities • Answering similar-page queries by starting with a different root set • Finding Multiple Hubs and Authorities using non-principle eigenvectors • Overcoming Diffusion and Generalization by using non-principal eigenvectors and basic term matching
PageRank vs. HITS • Computation: • Requires computation for each query • Query-dependent • Relatively easy to spam • Quality depends on quality of start set • Gives hubs as well as authorities • Computation: • Once for all documents and queries (offline) • Query-independent – requires combination with query-dependent criteria • Hard to spam FROM: Monika Henzinger, Hyperlink Analysis on the Web
[Lempel] Not rank-stable: O(1) changes in graph can change O(N2) order-relations [Ng,Zheng, Jordan01] “Value”-Stable: change in k nodes (with PR values p1,…pk) results in p* s.t. PageRank vs. HITS • Not rank-stable • “value”-stablility depends on gap g between largest and second largest eigenvector: change of O(g) nodes results in p* s.t. FROM: Monika Henzinger, Hyperlink Analysis on the Web
References/Slide Sources • Authoritative Sources in Hyperlinked EnvironmentJon M. Kleinberg JACM 1999 • Monika Henzinger “Hyperlink Analysis on the Web”. • Original Mini-web example http://www.cs.fiu.edu/~vagelis/presentations/RandomWalks.ppt • “Authoritative sources in a hyperlinked environment” Presentation By Vivek B. Tawde.
Conclusion • Influential paper • Citeseer – 457 Citings • ACM – 115 Citings • Same time period as the Google page-rank algorithm