270 likes | 416 Views
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg. Presented By: Lekhendro. Outline. Introduction Constructing focused Subgraph Computing Hubs and Authorities Conclusion. Introduction. How to improve quality of search on WWW ?
E N D
Authoritative Sources in a Hyperlinked environmentJon M. Kleinberg Presented By: Lekhendro
Outline • Introduction • Constructing focused Subgraph • Computing Hubs and Authorities • Conclusion
Introduction • How to improve quality of search on WWW ? • Quality of search requires human evaluation due to the subjectivity inherent in notions such as relevance. • The quality of search results and storage are orthogonal.
Queries and Authoritative Sources • Types of queries • Specific queries E.g. “Does Netscape support the JDK 1.1 code-signing API?” • Broad-topic queries E.g. “Find information about the Java programming language.” • Handling specific queries is difficult. • Scarcity problem- There are few pages containing those information and it is difficult to determine the identity of those pages. • For broad topic queries, there are sometimes thousands of relevant pages. • Abundance problem: The number of pages that could reasonably be returned as relevant is far too large for a human user to digest. • One needs a way to filter a small set of the authoritative or definitive pages from a huge collection of relevant pages.
Limitations of text based analysis • Text-based ranking function • E.g. For the “harvard”, www.harvard.edu is proper authoritative page but there may be lots of other web pages containing “harvard” more often. • Most popular Pages are not sufficiently self–descriptive. • Usually the term “search engine” doesn’t appear on search engine home web pages of Yahoo, AltaVista, Excite etc. • Honda or Toyota home pages hardly contain the term “automobile manufacturer”.
Analysis of link structure • Hyperlinks encode a latent human judgment which can be used to formulate a notion of authority. • Creation of a link represents a concrete indication of the following type of judgment • The creator of page p, by including a link to page q, has in some measure conferred authority on q. • Opportunity for the user to find potential authorities purely through the pages that point to them. • In this paper a link-based model for the conferral of authority has been proposed. • It has been shown that the proposed method consistently identifies relevant authoritative web pages for broad search topics. • However, there are pitfalls of above concept. • Most links are created for navigational purposes. • Difficult to balance between appropriate relevance and popularity
Authorities and Hubs • Authoritiesare pages that are recognized as providing significant, trustworthy, and useful information on a topic. • Hubsare index pages that provide lots of useful links to relevant content pages (topic authorities). • In-degree - Number of pointers to a page and is one simple measure of authority. • Out-degree - Number of pointers from a page to other pages.
Overview • Discover authoritative WWW sources globally. • Determine hubs and authorities on a particular topic through analysis of a relevant sub-graph of the web. • Given Keyword Query, assign a hub and an authoritative value to each page. • Pages with high authority are results of query
Hubs & Authorities • Mutually reinforcing relationship: • Hubs point to lots of authorities. • Authorities are pointed to by lots of hubs • Good hub: page that points to many good authorities. • Good authority: page pointed to by many good hubs.
Constructing a focused subgraph of WWW • Terms: • A collection of hyperlinked pages can be viewed as a directed graph G=(V,E); nodes correspond to pages, and a directed edge (p,q) ε E indicates the presence of a link from p to q. • Given a query string , determine the sub-graph G of WWW. • The graph may include all the pages containing the query string. • This approach has the following drawbacks. • The set may contain millions of pages • Best authorities may not belong to this set. • Focus is on Spages with the following properties. • S is very small • S is rich in relevant pages. • S contains most of the strongest authorities.
Hubs and Authorities • Together they tend to form a bipartite graph: Authorities Hubs
Root Set Root Set and Base Set • Collect a root set,R(top ranked) of pages based on the query using text-based search engine (AltaVista). • R satisfies 1 and 2 but may not satisfy 3. • R contains the string (query) hence it is subset of Q set containing all the pages containing the query. • A strong authority of query topic although it may not be in root set, quite likely to be pointed to by at least one page in root set. • The number of authorities can be increased by expanding root set along the links that enter and leave it.
Root Set and Base Set (Cont’d)… • Expand rootset into base set by including (up to a designated size cut-off) • all pages linked to by pages in root set • all pages that link to a page in root set • Typical base set contains roughly 1000-5000 pages Base Set Root Set
Heuristic • Two types of links. • Transverse: if it is between pages with different domain names. • Intrinsic: if it is between pages with the same domain name. • Delete all intrinsic links • Most of them are for navigation purposes • Less informative or information repetition • Or keep upto m(4 to 8) pages of same domain
Iterative Algorithm • For each page p S maintain: • Authority score : ap (vectora) • Hub score : hp (vectorh) • Initialize all ap = hp = 1 • Maintain normalized scores:
Computing Hubs and authorities v1 v1 h(v1) a(v1) v2 p p v2 h(v2) a(v2) h(v3) v3 v3 a(v3)
Hubs and authorities computation (contd) … • Authorities are pointed to by lots of good hubs: • Hubs point to lots of good authorities:
Iterative Algorithm Initialize for all p S: ap = hp = 1 For i = 1 to k: For all p S: (update auth. scores) For all p S: (update hub scores) For all p S:ap= ap/c c: For all p S:hp= hp/c c: (normalizea) (normalizeh)
X Y Z é ù 1 1 1 X ê ú ê ú = M 0 0 1 Y ê ú ê ú 1 1 0 Z ê ú ë û T = H H M M * i - i 1 T = A M M A * * - i i 1 Example: Mini Web = H M A * - i i 1 X T = A M H * - i i 1 Z Y
X is the best hub Z is most authoritative Example ¥ Iteration 0 1 2 3 … X Z Y
Results • Authorities for query: “Java” • java.sun.com • comp.lang.java FAQ • Authorities for query “search engine” • Yahoo.com • Excite.com • Lycos.com • Altavista.com • Authorities for query “Gates” • Microsoft.com • roadahead.com
Conclusions • A technique for locating high-quality information related to broad search topic based on link analysis. • Performed on the set of retrieved web pages for each query • Computes authorities and hubs • No indexing is needed. Only interface to different search engines is needed. • IBM expanded HITS into CLEVER but not seen as viable search engine. (computation of real-time execution is hard).
Basic knowledge of Matrix M: symmetric n*n matrix :vector : a number If for some vector , M = , we say, The set of all such is a subspace of Rn Eigenspace associated with ; These 1(M), 2(M), … are eigenvalues, while 1(M), 2(M), … are eigenvectors i(M) belongs to the subspace of i(M) If we assume |1(M) > 2(M)|, we refer to 1(M) as the principal eigenvector, and all other i(M) as non-principal eigenvector.
Convergence Proof of Iterate Procedure • Theorem1. The sequences x1, x2, x3, … and y1, y2, y3, … converge to x* and y* respectively. • Proof:G=(V,E); V={p1, p2, …, pn}; A is the adjacency matrix of graph G; Aij = 1 if (pi, pj) is an edge of G. I & O operations can be written as: x ATy y Ax K loops, So, x (1) AT Ax (0); x(0) = AT z x* … x (k) (AT A)k-1 AT z y* … y(k) (AAT)k z “if is a vector not orthogonal to the principle eigenvector 1(M), the unit vector in the direction of Mk converges to 1(M) as k increases without bound”
Convergence Proof of Iterate Procedure(cont.) A is called an orthogonal matrix if AAT = ATA = E. • Theorem2: x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT. • Experiment finds that k=20 is sufficient for the convergence of vectors.
Reference • http://crystal.uta.edu/~gdas/Courses/websitepages/spring06DBIR.htm • http://www.iiit.net/~pkreddy