Authoritative Sources in a Hyperlinked Environment

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn

Searching the Web • Goal: find pages relevant to a query. • The basic text-based search algorithms retrieve pages that contain the query keywords. • Improved searching algorithms can examine the link structure of the web to learn about the contents of web pages. • This paper introduces an algorithm for identifying authoritative pages and hub pages.

Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Wrap-up

Types of Queries • Specific queries: information about the topic is scarce. • Broad-topic queries: information about the topic is overabundant. We want to return the most ‘authoritative’ pages. • Similar-page queries: find pages that are ‘like’ a given page. This paper examines broad-topic queries.

Complications with Text-based Search • An authoritative page for a query may not contain the query terms. • Example: www.uh.edu contains neither ‘University’ nor ‘Houston’, and has ‘UH’ only six times. • Text may be in the form of images or flash animations. • A page might not be self-descriptive. • Example: Honda does not describe itself as an automobile manufacturer and Google does not describe itself as a search engine.

Examining Link Structure • The creator of a page p, by including a link to a page q, confers authority in some way to page q. • How can we exploit this latent human judgment information? • Pitfall: Many links, such as navigational links and advertisement links do not confer authority.

Exploiting Link Structure 1 • An authoritative page must be popular. • So, of all pages that contain the query terms, return those with the highest in-degree. • Pitfall: Still misses authoritative pages that do not contain the query terms. • Pitfall: Universally popular pages (like www.yahoo.com) will be considered highly authoritative for any query terms they contain.

Exploiting Link Structure 2 • Authoritative sources often do not link to other authoritative sources. • Examples: Toyota does not link to Honda, and Google does not link to Teoma. • Other pages, which we call hub pages, link to multiple authoritative sources. • Example: Auto enthusiast websites linking to multiple manufacturer’s websites. • The authoritative pages for a query share many hub pages.

Algorithm Overview • For a query , start with a text-based search to generate an initial root set R. • Enlarge the root set to a base set S. • Identify authoritative pages and hub pages in S. • Return the most authoritative pages in S.

Desiderata for S S should be: • Relatively small. • Rich in relevant pages. • Contain most (or many) of the strongest authorities. R will satisfy 1 and 2, but not 3. Even the set of all pages that contain the query terms may not satisfy 3.

Enlarging R to S • Pages in R may not be authoritative, but most authoritative pages are probably pointed to by at least one member of R. • Pages in R may not point to each other. • Let S = R + all pages pointed to by pages of R + some pages that point to pages of R. • Use a heuristic to avoid navigation links. Kleinberg’s experiments had R  200 and S  1000 to 5000.

Identifying Hubs and Authorities • Our set S still has the problem of non-authoritative pages of high in-degree. • The authoritative pages are the popular pages that have a large overlap in the sets of pages that point to them. • The hub pages are the pages that point to many of the authoritative pages.

Hubs and Authorities Picture Unrelated page of large in-degree authorities hubs

Mutually Reinforcing Relationship • Good hubs point to many good authorities. • Good authorities are pointed to by many good hubs. • There must be an iterative algorithm.

Iterative Algorithm 1 • For each page p, we associate a non-negative authority weightx(p) and a non-negative hub weight y(p). • Values are normalized • Larger values indicate better pages.

Iterative Algorithm 2 • If p points to many pages with large x-values, then p receives a large y-value: • If p is pointed to by many pages with large y-values, then p receives a large x-value:

Iterative Algorithm 3 • We iterate and renormalize until values converge. • Therefore, we need to prove convergence. • The algorithm is a discrete-time evolution and can be written as multiplications of matrices and vectors • A result of linear algebra guarantees convergence of X and Y to the principle eigenvectors of MTM and MMT.

X Y Z é ù 1 1 1 X ê ú ê ú = M 0 0 1 Y ê ú ê ú 1 1 0 Z ê ú ë û T = H H M M * i - i 1 T = A M M A * * - i i 1 Example: Mini Web = H M A * - i i 1 X T = A M H * - i i 1 Z Y

X is the best hub Z is most authoritative Example ¥ Iteration 0 1 2 3 … X Z Y

Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Example • Wrap-up

Notes to Consider • In general, we don’t need to iterate to convergence. • Paper contains a list of good results for various queries. • After initial text-based search, the text was ignored in favor of the link structure.

Related Areas • Similar-page queries. • Connections with: • Social networks • Bibliometrics (citations) • Stand-alone hypertext environments • Clustering of link structures • Multiple sets of hubs and authorities • Diffusion and Generalization

Conclusion • Influential paper – many citations. • Published at the same time as the Google page-rank algorithm. • HITS – Hyperlink Induced Topic Search • Clever (IBM) • Basis of Teoma search engine algorithm.

References Kleinberg, Jon. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, Vol. 46, No. 5, September 1999, pp. 604-632. The mini-web example comes from http://www.cs.fiu.edu/~vagelis/presentations/RandomWalks.ppt

The End

Authoritative Sources in a Hyperlinked Environment

Authoritative Sources in a Hyperlinked Environment

Presentation Transcript

Authoritative Source for Location

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Authoritative Branding

Jesus the Authoritative Rabbi

Authoritative Sources in a Hyperlinked Environment

Authoritative Sources in a Hyperlinked environment

The Authoritative Bible

Authoritative Ideas in Augustine

Sources of innovation opportunities in the business environment

Spatial unmasking of nearby pure-tone sources in a simulated anechoic environment

Sources of Radiation in the Environment

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Content by Slide (hyperlinked)

A review of sources and sinks for nitrate in the mining environment

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg

AUTHORITATIVE SOURCES IN HYPERLINKED ENVIRONMENT

Hyperlinked Presentation

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg

The IIA’s Authoritative Guidance

INTEGRATED NATURAL ENVIRONMENT AUTHORITATIVE REPRESENTATION PROCESS (INEARP)

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98)