140 likes | 287 Views
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran. Basic Idea R is grown to a set S so that it contains a rich amount of authoritative pages. Include any page to S that is pointed to by a page in R.
E N D
Authoritative Sources in a Hyperlinked Environment • Jon M. Kleinberg • ACM-SIAM Symposium,1998 • Krishna Venkateswaran
Basic Idea • R is grown to a set S so that it contains a rich amount of authoritative pages. Include any page to S that is pointed to by a page in R. R- Root set S contains t results. R S- Base set generated from algorithm. ‘S’ is used to determine the hubs and authorities.
Algorithm • Get a set of results for a query string from a text based search query. • Take the top ‘t’ results out of it and put it in a set R. • For every page in set R, • Add all the pages that the page points to into the set R. • Add a maximum of d pages that points to the page, into the set R. • The new result set is named S. Result returned: Base set S out of which we compute the top authorities and hubs.
Heuristics To determine what pages to add to the set S. • Heuristic 1: Avoiding navigational links. • Transverse links: links that are between pages with different domain names. • Intrinsic links (navigational links): links that are between pages within a domain. • Delete all intrinsic links. • Heuristic 2: Avoiding Mass endorsements. • Mass endorsements: A large number of pages in a domain pointing to a single page. • Example: “This site is designed by …” and a link. • Eliminate this by setting a parameter m and allowing only m pages from a single domain to point to a page.
Computing Hubs and Authorities • Extracting authorities from the overall collection of pages, through an analysis of the link structure of G. • Good hub points to many good authorities and a good authority is pointed to by many good hubs. Hubs Authorities unrelated page of large in-degree
Basic Idea • Each page p has a non negative authority weight and non negative hub weight. • If p points to pages with large authority weight values then the page has a large hub weight value. • If p is pointed to by pages with large hub weight values then the page has a large authority weight value. • Pages with higher weights are better authorities and hubs.
Basic Operations • I operation: • Authority weight of a page= Sum of all hub weights of pages pointing to the page. • O operation: • Hub weight of a page= Sum of all authority weights of pages, this page points to. • I and O reinforce each other. • Normalization: The values of the hub and authority weights are divided with a value so that the squares of the sum doesn’t exceed 1.
Contd... q1 q1 q2 y[p]=sum of all x[q]. page p page p q2 x[p]=sum of all y[q] q3 q3 Operation I Operation O Decision on when to stop the reinforcing process. • Apply I and O operations alternatively until a fixed point is reached. • Choose a specific parameter ‘k’ and iterate the process only to k number of times.
Algorithm • Given the set of pages in the form of a graph, set an integer value for parameter k. • k is the number of time the iteration occurs. • Repeat the following process k times. • Apply the I operation to a page and update its new authority weight. • Apply the O operation to a page and update its hub weight. • Normalize both the authority weight and the hub weight. • Return the graph with the new authority weight and hub weight for each page.
Observations • The top authorities and hubs are determined by finding the pages containing the top ‘c’ values for x and y from the graph resulted from the Iterate algorithm. • The Iterate procedure converges to fixed points x* and y* as k increases arbitrarily. • Proved using principal eigenvectors. • Iterate algorithm results in densely linked collection of pages- rich in relevant pages. • Most relevant collection of pages is the densest graph.
Results (java) Authorities .328 http://www.gamelan.com/ Gamelan .251 http://java.sun.com/JavaSoft Home Page .190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.html The Java Developer: HowDoI .190 http://lightyear.ncsa.uiuc.edu/srp/java/javabooks.htmlThe Java Book (\search engines") Authorities .346 http://www.yahoo.com/ Yahoo! .291 http://www.excite.com/ Excite .231 http://www.lycos.com/ Lycos Home Page .231 http://www.altavista.digital.com/ AltaVista: Main Page (Gates) Authorities .643 http://www.roadahead.com/ Bill Gates: The Road Ahead .458 http://www.microsoft.com/ Welcome to Microsoft .440 http://www.microsoft.com/corpinfo/bill-g.htm • It was observed that the www.roadahead.com was the only site that was present in R initially. • This supports the algorithm because many of the pages don’t contain the search query in them.