270 likes | 399 Views
Authoritative Sources in a Hyperlinked Environment. Jon M. Kleinberg Presentation by Julian Zinn. Searching the Web. Goal: find pages relevant to a query. The basic text-based search algorithms retrieve pages that contain the query keywords.
E N D
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn
Searching the Web • Goal: find pages relevant to a query. • The basic text-based search algorithms retrieve pages that contain the query keywords. • Improved searching algorithms can examine the link structure of the web to learn about the contents of web pages. • This paper introduces an algorithm for identifying authoritative pages and hub pages.
Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Wrap-up
Types of Queries • Specific queries: information about the topic is scarce. • Broad-topic queries: information about the topic is overabundant. We want to return the most ‘authoritative’ pages. • Similar-page queries: find pages that are ‘like’ a given page. This paper examines broad-topic queries.
Complications with Text-based Search • An authoritative page for a query may not contain the query terms. • Example: www.uh.edu contains neither ‘University’ nor ‘Houston’, and has ‘UH’ only six times. • Text may be in the form of images or flash animations. • A page might not be self-descriptive. • Example: Honda does not describe itself as an automobile manufacturer and Google does not describe itself as a search engine.
Examining Link Structure • The creator of a page p, by including a link to a page q, confers authority in some way to page q. • How can we exploit this latent human judgment information? • Pitfall: Many links, such as navigational links and advertisement links do not confer authority.
Exploiting Link Structure 1 • An authoritative page must be popular. • So, of all pages that contain the query terms, return those with the highest in-degree. • Pitfall: Still misses authoritative pages that do not contain the query terms. • Pitfall: Universally popular pages (like www.yahoo.com) will be considered highly authoritative for any query terms they contain.
Exploiting Link Structure 2 • Authoritative sources often do not link to other authoritative sources. • Examples: Toyota does not link to Honda, and Google does not link to Teoma. • Other pages, which we call hub pages, link to multiple authoritative sources. • Example: Auto enthusiast websites linking to multiple manufacturer’s websites. • The authoritative pages for a query share many hub pages.
Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Wrap-up
Algorithm Overview • For a query , start with a text-based search to generate an initial root set R. • Enlarge the root set to a base set S. • Identify authoritative pages and hub pages in S. • Return the most authoritative pages in S.
Desiderata for S S should be: • Relatively small. • Rich in relevant pages. • Contain most (or many) of the strongest authorities. R will satisfy 1 and 2, but not 3. Even the set of all pages that contain the query terms may not satisfy 3.
Enlarging R to S • Pages in R may not be authoritative, but most authoritative pages are probably pointed to by at least one member of R. • Pages in R may not point to each other. • Let S = R + all pages pointed to by pages of R + some pages that point to pages of R. • Use a heuristic to avoid navigation links. Kleinberg’s experiments had R 200 and S 1000 to 5000.
Identifying Hubs and Authorities • Our set S still has the problem of non-authoritative pages of high in-degree. • The authoritative pages are the popular pages that have a large overlap in the sets of pages that point to them. • The hub pages are the pages that point to many of the authoritative pages.
Hubs and Authorities Picture Unrelated page of large in-degree authorities hubs
Mutually Reinforcing Relationship • Good hubs point to many good authorities. • Good authorities are pointed to by many good hubs. • There must be an iterative algorithm.
Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Wrap-up
Iterative Algorithm 1 • For each page p, we associate a non-negative authority weightx(p) and a non-negative hub weight y(p). • Values are normalized • Larger values indicate better pages.
Iterative Algorithm 2 • If p points to many pages with large x-values, then p receives a large y-value: • If p is pointed to by many pages with large y-values, then p receives a large x-value:
Iterative Algorithm 3 • We iterate and renormalize until values converge. • Therefore, we need to prove convergence. • The algorithm is a discrete-time evolution and can be written as multiplications of matrices and vectors • A result of linear algebra guarantees convergence of X and Y to the principle eigenvectors of MTM and MMT.
X Y Z é ù 1 1 1 X ê ú ê ú = M 0 0 1 Y ê ú ê ú 1 1 0 Z ê ú ë û T = H H M M * i - i 1 T = A M M A * * - i i 1 Example: Mini Web = H M A * - i i 1 X T = A M H * - i i 1 Z Y
X is the best hub Z is most authoritative Example ¥ Iteration 0 1 2 3 … X Z Y
Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Example • Wrap-up
Notes to Consider • In general, we don’t need to iterate to convergence. • Paper contains a list of good results for various queries. • After initial text-based search, the text was ignored in favor of the link structure.
Related Areas • Similar-page queries. • Connections with: • Social networks • Bibliometrics (citations) • Stand-alone hypertext environments • Clustering of link structures • Multiple sets of hubs and authorities • Diffusion and Generalization
Conclusion • Influential paper – many citations. • Published at the same time as the Google page-rank algorithm. • HITS – Hyperlink Induced Topic Search • Clever (IBM) • Basis of Teoma search engine algorithm.
References Kleinberg, Jon. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, Vol. 46, No. 5, September 1999, pp. 604-632. The mini-web example comes from http://www.cs.fiu.edu/~vagelis/presentations/RandomWalks.ppt