420 likes | 446 Views
Auth oritative Sources in Hyperlinked Environment Jon M. Kleinberg JACM 1999. Presented By Raman Adaikkalavan Feb 23, 2005, CSE 6392 Instructor: Dr. Gautam Das. Overview. Problem – in general Query Types Problems of Answering Queries Authoritative Pages – Broad-topic queries
E N D
Authoritative Sources in Hyperlinked EnvironmentJon M. Kleinberg JACM 1999 Presented By Raman Adaikkalavan Feb 23, 2005, CSE 6392 Instructor: Dr. Gautam Das
Overview • Problem – in general • Query Types • Problems of Answering Queries • Authoritative Pages – Broad-topic queries • Iterate Method/Algorithm • Similar Page Queries • Multiple Sets of Hubs and Authorities • Diffusion and Generalization • Evaluation • Comparison – ? • Conclusion
Problem – in general • Searching on the www for discovering pages that are relevant to a given query • Improving Quality of search
Query Types • Does Netscape support the JDK 1.1 code signing API • Specific queries • Find information about the Java programming language • Broad-topic queries • Find pages ‘similar’ to java.sun.com • Similar-page queries
Problems with Answering Queries • Specific queries • Scarcity problem: Very few pages that contain required information • Difficult to determine the identity of the pages • Broad-topic queries • Abundance problem: Number of pages that could reasonably be returned as relevant is far too large for a human user to digest • Select a small set of the most “authoritative” or “definitive” ones – pages that are most relevant
Authoritative Pages – Central focus • Given a query how to get the small set of authoritative pages corresponding to that query • How to accurately model authority in the context of a particularquery topic • Text-based searching/ranking – Sufficient ? Many prominent pages are not sufficiently self-descriptive. • “harvard” – www.harvard.edu • “search engines” – Yahoo, AltaVista, …? • “automobile manufacturers” – Honda, Toyota, …?
Analysis of the Link Structure • Hyperlinks encode a considerable amount of latent human judgment – used for authority ? • e.g., the creator of page p, by including a link to page q, has in some measure conferred authority on q • a large number of links are created primarily for navigational purposes, back • Links to paid advertisements • relevance and popularity • Find pages using #inlinks – this would consider highly popular pages as authoritative – Yahoo.com
Conferral of Authority • Model that consistently identifies relevant, authoritative www pages for broad search topics • Based on the relationship between ‘authorities’ and ‘hubs’ • Authorities: Pages that have relevant information about a given topic • Hubs: Pages that link to many related authorities
Till Now WWW • Authoritativepages • Not only based on text • Usinglinkanalysis Information about Java PL (Broad Topic Queries)
Can We Operate Over Entire WWW ? • Specific to a query; i.e., not predefined • Computational costs – should be reduced • Analysis of the link structure; which subgraph www should be operated on ? • All pages containing query string • May be over million pages - computation • Some or most of the best authorities may not belong to this set
Finding Authoritative Pages • Steps • 1: Construct a focused subgraph (S) of the www; such that • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities • 2: Compute Hubs and Authorities from the focused subgraph
Expanded Set Pages S t highest-ranked pages Rootset R Topic Search Engine At most d pages Forward link pages Backward link pages Construction of Focused Subgraph
Offsetting Navigational Links • G[S] subgraph induced on the pages in S • Types of links • Transverse: if between pages with different domain names • Intrinsic: is between pages within the same domain name • Delete Intrinsic Links from G[S]; resulting in a graph G • Collusion: large # of pages from a single domain all point to a single page p. “This site is designated to…” Eliminate by a parameter m (approx 4 – 8)
Finding Authoritative Pages • Steps • 1: Construct a focused subgraph (S) of the www • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities • 2: Compute Hubs and Authorities from the focused subgraph
Computing Hubs & Authorities • Goal: Given a query find: • Good sources of content (authorities) • Good sources of links (hubs) FROM: Monika Henzinger, Hyperlink Analysis on the Web
Intuition • Authoritycomes from in-edges. Being a goodhubcomes from out-edges. • Better authoritycomes from in-edges from good hubs. Being a better hubcomes from out-edges to good authorities. FROM: Monika Henzinger, Hyperlink Analysis on the Web
Hubs and Authorities • An iterative algorithm • with each page p, we associate • a non-negative authority weight x<p> • a non-negative hub weight y<p> • weights of each type are normalized so their squares sum to 1 • pS(x<p>)2 = 1 pS(y<p>)2 = 1 • The pages with larger x and yvalues have “better” authorities and hubs respectively.
Hubs and Authorities • If ppoints to many pages with large x-values, then it should receive a large y-value • If p is pointed to by many pages with large y-values, then it should receive a large x-value • Inlinks I: • Outlinks O:
Hubs and Authorities • As one applies Iterate with arbitrary large k, the {xk} and {yk}converge to fixed points x* and y* • Let G = (V, E), with V = {p1, p2,…, pn}, and let A denote the adjacency matrix of the graph G: the (i, j)th entry of A is 1 if (pi, pj) is an edge of G, and is 0 otherwise. • x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT • The convergence of Iterate is quite rapid (k=20 is sufficient)
X X Y Y Z Z é é ù ù 1 1 1 0 1 1 X X ê ê ú ú ê ê ú ú = = M T M 0 1 0 0 1 1 Y Y ê ê ú ú ê ê ú ú 1 1 1 1 0 0 Z Z ê ê ú ú ë ë û û T = H H M M * i - i 1 T = A M M A * * - i i 1 Mini Web (Modified) Forward links Backward links HUBS AUTHORITIES X = H M A * - i i 1 T = A M H Z Y * - i i 1 SOURCE: Vagelis H, Random Walks Presentation
T = A M M A * * - i i 1 X is the best hub Z is the most authoritative Mini Web (Modified) T = H H M M * i - i 1 é ù é ù 2 2 1 3 1 2 ê ú ê ú ê ú ê ú = = 2 2 1 1 1 0 T T M M M M ê ú ê ú ê ú ê ú 1 1 2 2 0 2 ê ú ê ú ë û ë û ¥ Iteration 1 2 3 … X Z Y SOURCE: Vagelis H, Random Walks Presentation
Observations • Just “pure” analysis of link structure • i.e., text-based search is just an initial set • Pages legitimately considered as authoritative in the context of www without access to large-scale index of the www • i.e., global analysis of the full www link structure can be replaced by local method over small focused subgraph
Overview • Problem – in general • Query Types • Problems of Answering Queries • Authoritative Pages – Broad-topic queries • Iterate Method/Algorithm • Similar Page Queries • Multiple Sets of Hubs and Authorities • Diffusion and Generalization • Evaluation • Comparison • Conclusion
Similar-Page Queries • E.g., Find pages ‘similar’ to honda.com • Using links analysis to infer a notion of “similarity” among pages • We have found a page p that is of interest and it’s an authoritative page on a topic. • What do users of the WWW consider to be related to p when they create pages and links ? • If p is highly referenced ? – abundance problem
Similar-Page Queries • In the local region of the link structure nearp, what are the strongest authorities • Can be a potential broad-topic summary of pages related to p • Normal Search; a query string - “Find t pages containing ” as R and then get subgraph S • a page p -- “Find t pages pointing to p” as R and then get subgraph S
Multiple Sets of Hubs and Authorities • Broad-topic queries: most densely linked collection of hubs and authorities • Can we can find several densely linked collections of hubs and authorities among the same set S of pages. • Each collection could potentially be relevant to the query topic, but they could well-separated from one another in the graph G: • The query string may have several very different meanings. E.g. “jaguar”, “java”. • The string may arise as a term in the context of multiple technical communities. E.g. “randomized algorithms”. • The string may refer to a highly polarized issue, involving groups that are not likely to link to one another. E.g. “abortion”.
Multiple Sets of Hubs and Authorities • Relevant documents can be grouped in to several clusters • For Broad-topic Queries: x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT • Can we use the non-principal eigenvectors to extract additional densely linked collections of hubs and authorities • Positive and Negative
Diffusion and Generalization • Diffusion happens • if the specifies a topic that is not sufficiently broad, there will be not enough relevant pages in G • the most relevant collection in G is not the “densest” one • as a result the I and O operations will find the diffused collection of authority corresponding to the “broader” topics • Limits the algorithm • The broader topic that supplants the original, too-specific query very often represents a natural generalization of • It provides a simple way of abstracting a specific query topic to a broader related one.
Results – Diffusion & Generalization • The use of non-principal eigenvectors, combined with basic term-matching, can be a simple way to extract collections of authoritative pages that are more relevant to a specific query topic
Evaluation • 26 broad search topics, 37 users • For each topic, took the top 10 pages from AltaVista, the top five hubs and five authorities from Clever, and a random set of 10 pages from Yahoo • The results • For 31% of the topics, Yahoo and Clever received evaluations equivalent to each other • For 50%, Clever received a higher evaluation • For 19%, Yahoo received the higher evaluation
Summary • Answering Broad-topic queries • Finding Authoritative Pages using the good hubs and good authorities • Answering similar-page queries by starting with a different root set • Finding Multiple Hubs and Authorities using non-principle eigenvectors • Overcoming Diffusion and Generalization by using non-principal eigenvectors and basic term matching
PageRank vs. HITS • Computation: • Requires computation for each query • Query-dependent • Relatively easy to spam • Quality depends on quality of start set • Gives hubs as well as authorities • Computation: • Once for all documents and queries (offline) • Query-independent – requires combination with query-dependent criteria • Hard to spam FROM: Monika Henzinger, Hyperlink Analysis on the Web
[Lempel] Not rank-stable: O(1) changes in graph can change O(N2) order-relations [Ng,Zheng, Jordan01] “Value”-Stable: change in k nodes (with PR values p1,…pk) results in p* s.t. PageRank vs. HITS • Not rank-stable • “value”-stablility depends on gap g between largest and second largest eigenvector: change of O(g) nodes results in p* s.t. FROM: Monika Henzinger, Hyperlink Analysis on the Web
References/Slide Sources • Authoritative Sources in Hyperlinked EnvironmentJon M. Kleinberg JACM 1999 • Monika Henzinger “Hyperlink Analysis on the Web”. • Original Mini-web example http://www.cs.fiu.edu/~vagelis/presentations/RandomWalks.ppt • “Authoritative sources in a hyperlinked environment” Presentation By Vivek B. Tawde.
Conclusion • Influential paper • Citeseer – 457 Citings • ACM – 115 Citings • Same time period as the Google page-rank algorithm