290 likes | 474 Views
Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm. - Murtuza Shareef. Plan. Broad Picture of the talk Introduce Foundations (Terminology) The Problem to be solved Motivation behind HITS (Why Link Analysis) Construction of Sub-graph
E N D
Authoritative Sources in a Hyperlinked EnvironmentMore specifically “Link Analysis” using HITS Algorithm - Murtuza Shareef
Plan • Broad Picture of the talk • Introduce Foundations (Terminology) • The Problem to be solved • Motivation behind HITS (Why Link Analysis) • Construction of Sub-graph • Some basics of matrices • Design of Algorithm (Meat of this paper) • Application of HITS • Conclusion
Broad Picture of the talk • Goal of Search Engine is to provide quality search results - Relevance • Ways to achieve this goal - Linked structure of the web • The Algorithm ranks pages based on the relationship between hubs and authorities. • What are Hubs and Authorities? - Later
You need relevance – Start filtering User Query Base Set Heuristics Sub graph Text based search engine Pages Containing Query String Hits Algorithm Filter Set of ‘t’ pages Highest Ranked Pages – Root Set Hubs Authorities
Terminology • Authority: A valuable and informative webpage, usually pointed to by a large number of hyperlinks • Hub: A webpage that points to many authority pages is itself a resource and is called a hub • Authorities and hubs reinforce one another • A good authority is pointed to by many good hubs • A good hub points to many good authorities j i i j
Problem to be solved • Relevant terms may not appear on the pages of authoritative websites. • Many prominent pages are not self descriptive • Car manufacturers may not use the term “automobile manufacturers” on their home page. • The term “search engine” is not used by any of natural authorities like Yahoo, Google, AltaVista etc.
Link based Analysis • Limitations of text based analysis • Text-based ranking function • Eg. Could www.harvard.edu be recognized as one of the most authoritative pages, since many other webpages contain “harvard” more often. • Pages are not sufficiently self – descriptive • Usually the term “search engine” doesn’t appear on search engine web pages
Motivation behind HITS • The creator of page p, by including a link to page q, has in some measure conferred authority on q • Links afford us the opportunity to find potential authorities purely through the pages that point to them • What is the problem here? • Some links are just navigational “Click here to return to the main menu” • Some links are advertisements • Difficulty in finding balance between relevance and popularity • Solution: Based on relationship between the authorities for a topic and those pages that link to many related authorities - HUBS
HITS • Algorithm developed by Kleinberg in 1998. • Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. • Based on mutually recursive facts: • Hubs point to lots of authorities. • Authorities are pointed to by lots of hubs.
HITS Algorithm • Computes hubs and authorities for a particular topic specified by a normal query. • First determines a set of relevant pages for the query called the base set S. • Analyze the link structure of the web subgraph defined by S to find authority and hub pages in this set.
Construction of focused subgraph • We have a set created by text-based search engine. • Why do we need subset? • The set may contain too many pages and entail a considerable computational cost • Most of the best authorities may not belong to this set • Subset properties: • Relatively small • Rich in relevant pages • Contains most ( or many ) of the strongest authorities
Subset Construction Subgraph(σ, Ε, t, d) σ: a query string. Ε: a text-based search engine. t, d: natural numbers. Let Rσ denote the top t results of E on σ Set Sσ: = Rσ For each page p Є Rσ Let Γ+ (p) denote the set of all pages p points to. Let Γ- (p) denote the set of all pages pointing to p. Add all pages in Γ+(p) to Sσ. If |Γ- (p)| ≤ d then Add all pages in Γ- (p) to Sσ. Else Add an arbitrary set of d pages from Γ-(p) to Sσ. End Return Sσ
For a specific query Q, let the set of documents returned by a standard search engine be called the root set R. • Initialize S to R. • Add to S all pages pointed to by any page in R. • Add to S all pages that point to any page in R. R S
Subgraph reduction • Offset the effect of links that serve purely a navigational function • Remove all intrinsic edges from the graph, keeping only the edges corresponding to transverse links • Remove links that are mentioned in more than m pages (m=4-8).
Handling “spam” links Should all links be equally treated? Two considerations: • Some links may be more meaningful/important than other links. • Web site creators may trick the system to make their pages more authoritative by adding dummy pages pointing to their cover pages (spamming).
Handling “spam” links (contd) • Transverse link: links between pages with different domain names. • Domain name: the first level of the URL of a page. • Intrinsic link: links between pages with the same domain name. Transverse links are more important than intrinsic links. Two ways to incorporate this: • Use only transverse links and discard intrinsic links. • Give lower weights to intrinsic links.
Handling “spam” links (contd) How to give lower weights to intrinsic links? In adjacency matrix A, entry (p, q) should be assigned as follows: • If p has a transverse link to q, the entry is 1. • If p has an intrinsic link to q, the entry is c, where 0 < c < 1. • If p has no link to q, the entry is 0.
Basics of matrices • Adjacency matrix of directed graph G is the matrix A such that: = 1 (i , j) E(G) or = 0 (i , j) E(G). • An eigenvalue is a scalar with property that there exists a non-zero vector x, such that Ax = x. The vector x is called Eigen vector of A. • The normalized eigenvector corresponding to the largest eigenvalue is called the principal eigenvector. • If M is a symmetric n x n matrix and v is a vector not orthogonal to principal Eigen vector then the unit vector in the direction of converges to
Iterative Algorithm • Each page p is assigned two non-negative weights, an authority weight x and a hub weight y. • Update the weights of x and y Authority Weight: I Operation Hub Weight : O Operation These operations add the weights of hubs into the authority weight and add the authority weights into the hub weight, respectively. Alternating these two operations will eventually result in an equilibrium value, or weight, for each page.
Iterative Algorithm The algorithm states: For each iteration, apply the I and O operations and normalize theauthority and hub scores.
The top c authorities and top c hubs may be found using this simple procedure:
Convergence • Iteration algorithm converges as k increases. That is, the weights (vectors) converge. Let G = ( V , E ) with V = {p1, p2 … pn} Let A be the adjacency matrix of G. I and O operations can be written as Let be the authority scores after i iterations. Let be the hub scores after i iterations. Operation I Operation O
From the basics of matrices the vectors and converge to x* and y* respectively, where x* and y* are the principal Eigen vectors of and • Kleinberg says that 20 iterations are sufficient to obtain convergence • The “principal eigenvector” represents the densest cluster in the focused subgraph • The non-principal eigenvectors represent less dense areas in the subgraph
Application - Finding Similar Pages Using Link Structure • Given a page, P, let R (the root set) be t (e.g. 200) pages that point to P. • Grow a base set S from R. • Run HITS on S. • Return the best authorities in S as the best similar-pages for P. • Finds authorities in the “link neighbor-hood” of P as its similar pages.
Application - HITS for Clustering • An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. • Non-principal eigenvectors may contain hubs & authorities for other meanings. • Example: “jaguar”: • Atari video game (principal eigenvector) • NFL Football team (2nd non-principal eigenvector) • Automobile (3rd non-principal eigenvector) • This is clustering!
Multiple sets of Hubs and Authorities • Why? • The query string may have several very different meanings. Eg. “java” • The string may arise as a term in the context of multiple technical communities. Eg. “randomized algorithms” • The string may refer to a highly polarized issue, involving groups that are not likely to link to one another. Eg. “abortion” • Idea: • The NON-principal eigenvectors of ATA and AAT provide us with a natural way to extract additional densely linked collections of hubs and authorities from the base set S.
Multiple sets of Hubs and AuthoritiesExperimental result 1 • For the query “jaguar”, the strongest collections of authoritative sources concerned the Atari Jaguar product, the NFL football team from Jacksonville, and the automobile.
Multiple sets of Hubs and AuthoritiesExperimental result 2 • For the query “randomized algorithms”, none of the strongest collections of hubs and authorities are precisely on the query topic. They include home pages of theoretical computer scientists, compendia of mathematical software and pages on wavelets.
Conclusion • A technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic. • Related work. • Standing, influence in social networks, scientific citations • Hypertext and WWW rankings