1 / 27

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg. Presented By: Lekhendro. Outline. Introduction Constructing focused Subgraph Computing Hubs and Authorities Conclusion. Introduction. How to improve quality of search on WWW ?

clover
Download Presentation

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authoritative Sources in a Hyperlinked environmentJon M. Kleinberg Presented By: Lekhendro

  2. Outline • Introduction • Constructing focused Subgraph • Computing Hubs and Authorities • Conclusion

  3. Introduction • How to improve quality of search on WWW ? • Quality of search requires human evaluation due to the subjectivity inherent in notions such as relevance. • The quality of search results and storage are orthogonal.

  4. Queries and Authoritative Sources • Types of queries • Specific queries E.g. “Does Netscape support the JDK 1.1 code-signing API?” • Broad-topic queries E.g. “Find information about the Java programming language.” • Handling specific queries is difficult. • Scarcity problem- There are few pages containing those information and it is difficult to determine the identity of those pages. • For broad topic queries, there are sometimes thousands of relevant pages. • Abundance problem: The number of pages that could reasonably be returned as relevant is far too large for a human user to digest. • One needs a way to filter a small set of the authoritative or definitive pages from a huge collection of relevant pages.

  5. Limitations of text based analysis • Text-based ranking function • E.g. For the “harvard”, www.harvard.edu is proper authoritative page but there may be lots of other web pages containing “harvard” more often. • Most popular Pages are not sufficiently self–descriptive. • Usually the term “search engine” doesn’t appear on search engine home web pages of Yahoo, AltaVista, Excite etc. • Honda or Toyota home pages hardly contain the term “automobile manufacturer”.

  6. Analysis of link structure • Hyperlinks encode a latent human judgment which can be used to formulate a notion of authority. • Creation of a link represents a concrete indication of the following type of judgment • The creator of page p, by including a link to page q, has in some measure conferred authority on q. • Opportunity for the user to find potential authorities purely through the pages that point to them. • In this paper a link-based model for the conferral of authority has been proposed. • It has been shown that the proposed method consistently identifies relevant authoritative web pages for broad search topics. • However, there are pitfalls of above concept. • Most links are created for navigational purposes. • Difficult to balance between appropriate relevance and popularity

  7. Authorities and Hubs • Authoritiesare pages that are recognized as providing significant, trustworthy, and useful information on a topic. • Hubsare index pages that provide lots of useful links to relevant content pages (topic authorities). • In-degree - Number of pointers to a page and is one simple measure of authority. • Out-degree - Number of pointers from a page to other pages.

  8. Overview • Discover authoritative WWW sources globally. • Determine hubs and authorities on a particular topic through analysis of a relevant sub-graph of the web. • Given Keyword Query, assign a hub and an authoritative value to each page. • Pages with high authority are results of query

  9. Hubs & Authorities • Mutually reinforcing relationship: • Hubs point to lots of authorities. • Authorities are pointed to by lots of hubs • Good hub: page that points to many good authorities. • Good authority: page pointed to by many good hubs.

  10. Constructing a focused subgraph of WWW • Terms: • A collection of hyperlinked pages can be viewed as a directed graph G=(V,E); nodes correspond to pages, and a directed edge (p,q) ε E indicates the presence of a link from p to q. • Given a query string , determine the sub-graph G of WWW. • The graph may include all the pages containing the query string. • This approach has the following drawbacks. • The set may contain millions of pages • Best authorities may not belong to this set. • Focus is on Spages with the following properties. • S is very small • S is rich in relevant pages. • S contains most of the strongest authorities.

  11. Hubs and Authorities • Together they tend to form a bipartite graph: Authorities Hubs

  12. Root Set Root Set and Base Set • Collect a root set,R(top ranked) of pages based on the query using text-based search engine (AltaVista). • R satisfies 1 and 2 but may not satisfy 3. • R contains the string (query) hence it is subset of Q set containing all the pages containing the query. • A strong authority of query topic although it may not be in root set, quite likely to be pointed to by at least one page in root set. • The number of authorities can be increased by expanding root set along the links that enter and leave it.

  13. Root Set and Base Set (Cont’d)… • Expand rootset into base set by including (up to a designated size cut-off) • all pages linked to by pages in root set • all pages that link to a page in root set • Typical base set contains roughly 1000-5000 pages Base Set Root Set

  14. Subgraph construction algorithm

  15. Heuristic • Two types of links. • Transverse: if it is between pages with different domain names. • Intrinsic: if it is between pages with the same domain name. • Delete all intrinsic links • Most of them are for navigation purposes • Less informative or information repetition • Or keep upto m(4 to 8) pages of same domain

  16. Iterative Algorithm • For each page p  S maintain: • Authority score : ap (vectora) • Hub score : hp (vectorh) • Initialize all ap = hp = 1 • Maintain normalized scores:

  17. Computing Hubs and authorities v1 v1 h(v1) a(v1) v2 p p v2 h(v2) a(v2) h(v3) v3 v3 a(v3)

  18. Hubs and authorities computation (contd) … • Authorities are pointed to by lots of good hubs: • Hubs point to lots of good authorities:

  19. Iterative Algorithm Initialize for all p  S: ap = hp = 1 For i = 1 to k: For all p  S: (update auth. scores) For all p  S: (update hub scores) For all p  S:ap= ap/c c: For all p  S:hp= hp/c c: (normalizea) (normalizeh)

  20. X Y Z é ù 1 1 1 X ê ú ê ú = M 0 0 1 Y ê ú ê ú 1 1 0 Z ê ú ë û T = H H M M * i - i 1 T = A M M A * * - i i 1 Example: Mini Web = H M A * - i i 1 X T = A M H * - i i 1 Z Y

  21. X is the best hub Z is most authoritative Example ¥ Iteration 0 1 2 3 … X Z Y

  22. Results • Authorities for query: “Java” • java.sun.com • comp.lang.java FAQ • Authorities for query “search engine” • Yahoo.com • Excite.com • Lycos.com • Altavista.com • Authorities for query “Gates” • Microsoft.com • roadahead.com

  23. Conclusions • A technique for locating high-quality information related to broad search topic based on link analysis. • Performed on the set of retrieved web pages for each query • Computes authorities and hubs • No indexing is needed. Only interface to different search engines is needed. • IBM expanded HITS into CLEVER but not seen as viable search engine. (computation of real-time execution is hard).

  24. Basic knowledge of Matrix M: symmetric n*n matrix  :vector : a number If for some vector , M  = , we say, The set of all such  is a subspace of Rn Eigenspace associated with ; These 1(M), 2(M), … are eigenvalues, while 1(M), 2(M), … are eigenvectors i(M) belongs to the subspace of i(M) If we assume |1(M) > 2(M)|, we refer to 1(M) as the principal eigenvector, and all other i(M) as non-principal eigenvector.

  25. Convergence Proof of Iterate Procedure • Theorem1. The sequences x1, x2, x3, … and y1, y2, y3, … converge to x* and y* respectively. • Proof:G=(V,E); V={p1, p2, …, pn}; A is the adjacency matrix of graph G; Aij = 1 if (pi, pj) is an edge of G. I & O operations can be written as: x  ATy y  Ax K loops, So, x (1) AT Ax (0); x(0) = AT z x*  … x (k) (AT A)k-1 AT z y*  … y(k) (AAT)k z “if  is a vector not orthogonal to the principle eigenvector 1(M), the unit vector in the direction of Mk converges to 1(M) as k increases without bound”

  26. Convergence Proof of Iterate Procedure(cont.) A is called an orthogonal matrix if AAT = ATA = E. • Theorem2: x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT. • Experiment finds that k=20 is sufficient for the convergence of vectors.

  27. Reference • http://crystal.uta.edu/~gdas/Courses/websitepages/spring06DBIR.htm • http://www.iiit.net/~pkreddy

More Related