280 likes | 483 Views
Authoritative Sources in a Hyperlinked Environment. By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June 30 2011. Ranking for searching results.
E N D
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June 30 2011
Ranking for searching results • Modern search engines may return millions of pages for a single query. This amount is prohibitive to preview for human users, hence need a method to filter a small set of most authoritative results. • An ranking method will help to process the query results and put the most useful information on the top of the list. • Link based methods focus on the way that pages reference on another, provided an efficient way to filter the authoritative results. • Queries: • Specific queries. E.g. “What does Dr. Chris Mattmann’s think of the presentations between 3:30-5:00 PM PDT, June 30 2011. ” – very few pages, difficult to determine the identity of these pages. • Broad-topic queries. E.g. “java” – Too many pages, difficult to find the authority pages for traditional text-based search engine. • Similar-page queries. E.g. “find page similar to java” – similar as broad-topic queries.
Related to Class material • HITS stands for Hypertext Induced Topic Search • HITS was a pioneered link based ranking. One of the major web ranking model mentioned in the class. • This presentation will goes into the details of how to calculate “authority” and “hub” pages, which is mentioned in the class. • We will compare with the other link based algorithm: PageRank • We will evaluate the pros and cons of the paper.
Outline • Link-based algorithms • HITS algorithm • Constructing a Focused Subgraph of the WWW • Computing Hubs and Authorities • Comparison with PageRank • Expansions • Similar-Page Queries (modification) • Social Network/Scientific Citation • Multiple Set of Hubs and Authorities • Diffusion and generalization • Evaluation • Pros and Cons of the paper
Link based ranking algorithm • Challenge of the text-based ranking • www.harvard.edu, most authoritative pages for query “harvard”. However, other pages may content “harvard” keyword more often. • Pages are not sufficiently self descriptive: e.g. query “search engine”. Google do not use the term on their pages. • Number of pages too large to preview.
Link based ranking algorithm • Links encoded some human latent judgment • Creating a page p by including a link to page q has in some measure conferred authority on q. No need self-descriptive. • Balance of relevance and popularity in the authority criteria (automobile VW, Benz, BMW webpage, also www.yahoo.com, large number of in-degree, lack thematic unity.)
Link based ranking algorithm • Authority: A authority is a page with many in-links. • The page may have good or authoritative content on some topic and many people trust it and link to it. • Hub: A hub is a page with many out-links. • The page serves as an organizer of the information on a particular topic and points to many good authority pages on the topic.
Link based ranking algorithm • PageRank (Brin & Page 1998): • Computed for all the webpages before query (Query independent). • Compute the authority only • Fast to compute • HITS • Performed on the set of retrieved webpages for each query (Query dependent) • Compute authority and hubs • More calculation needed, slow in real time query
HITS Algorithm Requirement: Sq (collection of pages wrt query q) is small Sq is rich in relevant pages Sq contains most of the strongest authorities Subgraph(q,E,t,d) q: a query string E: a text-based searching engine /*Narrow down: form AltaVista*/ Let Rq denote the top t results of E on q. Set Sq := Rq For each page p in Rq : /*Expanding*/ Add all pages that p points to into the Sq; Add all pages point to p to Sq. (If the number of these pages is greater than d, randomly select d pages and add to Sq.) /* Limit: a single pointed pages can bring in maximum d pages. Otherwise, can involve hundred thousands extra pages */ /*remove intrinsic links (for website navigation), and anti-collusion (allow up to m pages from a single domain to point to any given page)*/ Return Sq • Step1: Constructing a Focused Subgraph of the WWW.
HITS Algorithm • Step 2: Computing Hubs and Authorities Rules: A good hub points to many good authorities. A good authority is pointed to by many good hubs. Authorities and hubs have a mutual reinforcement relationship. Let authority score of the page i be x(i), and the hub score of pagei be y(i). mutual reinforcing relationship: I step: O step:
HITS Algorithm 5 2 3 1 1 6 4 7 y(1) = x(5) + x(6) + x(7) x(1) = y(2) + y(3) + y(4)
HITS Algorithm • Recap: • If A is a square matrix, a non-zero vector v is an eigenvector of A if there is a scalar λ such that Av = λv
HITS Algorithm • The Iterate(G,k) procedure can be applied to filter out the top c authorities and top c hubs.
HITS Results • www.roadahead.com rank 123rd by AltaVista. • Text-based search ignore the authorities. • Text-based search + link analysis works. Do not content many of the query string “Gates”.
Related work • Similar page queries: • find t pages containing the string q • find t pages pointing to p. • Honda ford, toyota, etc. • Social Network • Measure of standing by path counting(Katz): • Scientific Citations • Multiple set of Hubs and Authorities • Same query string corresponding to different meaning.
Highlights of the method • Developed a set of algorithmic tools for extracting information from the link structures environments. • Formulate the notion of authority based on relationship between a set of “authority” pages and “hub” pages. • Proposed a heuristic algorithm to find these pages. • Surveyed variants and applications
Evaluation: HITS vsPageRank • EigenGaps • Difference between the largest and 2nd largest eigenvalue of M matrix. • Work from Ng 2001, compared the stability of convergence. Idea: The Cora database is a collection containing the citation (similar to link) information from several thousand papers in AI. Article is truly authoritative or influential, then surely the addition of a few links or a few citations should not make us change our minds about these sites or articles having been very influential. Based on this idea, Ng et. al. constructed a set of five perturbed databases in which 30% of the papers from the base set were randomly deleted
Evaluation: HITS vsPageRank • HITS • PageRank
Evaluation: HITS vsPageRank • The eigenvalues of the matrices are indicated by the directions of the principal axes of the ellipses. • Small perturbation cause 45 degree change when eigengap small. No change when eigengap large.
Evaluation: Pros • Creative idea of formulating the authority concept into “Authority” and “Hub”, especially in 1998 • Efficient heuristic algorithm so solve the Authority weights and Hub weights. • Query-driven dynamic ranking • Solid theoretical background • Abundant variants and applications
Evaluation: Cons • The convergence is not as robust as PageRank when there are some perturbation. • Topic drift • In-efficiency at run-time. • User behavior information is not integrated.
Reference • J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997. • Stable algorithms for link analysis. A. Y. Ng, A. X. Zheng, and M. I. Jordan. Proceedings of the 24th International Conference on Research and Development in Information Retrieval (SIGIR), New York, NY: ACM Press, 2001 • Wikipedia: www.wikipedia.org
Questions? • Thanks for time!