180 likes | 377 Views
Authoritative Sources in a Hyperlinked Environment. Paper By : Jon M. Kleinberg Presented By: Anirudh Ranganath ranganaa@usc.edu http://anirudh.co.in CSCI 572, Spring 2013, Prof. Chris Mattman , USC. HITS.
E N D
Authoritative Sources in a Hyperlinked Environment Paper By : Jon M. Kleinberg Presented By: Anirudh Ranganath ranganaa@usc.edu http://anirudh.co.in CSCI 572, Spring 2013, Prof. Chris Mattman, USC
HITS Extracts information from the link structures of hyperlinked environments– used to rank pages Precursor to Pagerank, but released in the same year. Similarity to Pagerank: Iterative, based on link structures Where HITS differs: 1) Query dependent2) Two scores – Hub and Authority 3) Processes on small subset of relevant documents (Query dependent) Authoritative Sources in a Hyperlinked Environment
HITS in a nutshell • Topic covered in class in a high level, along with PageRank • Basic Idea • Create a focused subgraph of the web • Compute hub and authority scores • Filter out top hubs and authorities • Extended Ideas • Similar page queries • Non principal Eigenvectors Authoritative Sources in a Hyperlinked Environment
Common Problems • Specific queries : “Does Netscape support the JDK 1.1 code-signing API?” • Scarcity Problem : Very few pages that contain the required information, it is difficult to determine the identity of these pages • Broad-topic queries : “Find information about the Java programming language.” • Abundance Problem: The number of pages that could reasonably be returned as relevant is far too large for a human user to digest • Similar-page queries : “Find pages similar to java.sun.com.” Complexity and scale (increasing) Unorganized Subjectivity of quality and relevance – lack of objective functions No purely endogenous measure Query Structure: Authoritative Sources in a Hyperlinked Environment
Analysis of the Link Structure • Pitfalls : • Links could be created for reasons which have nothing to do with conferral of authority • Could be for navigation • Difficulty in finding an appropriate balance between the criteria of relevance and popularity , each of which contributes to our intuitive notion of authority – popular directories Creator of page p has conferred authority on page q Image credit: kolberg.co.uk Links encode considerable amount of latent human judgement Author of page p, by linking to page q, confers authority on q Authoritative Sources in a Hyperlinked Environment
Constructing a Focused Subgraph of the WWW • Start with a root set generated using text search (t=200) • Add all pages that link to and from the pages in the result set iteratively • Problem: Some pages have too many pages pointing to themSolution: Add at most d (=50) pages that link to the page in consideration • Problem: Many links exist purely for navigation purposesSolution: Remove intrinsic links, retain transverse links. • Problem: There are links conferring advertisement, endorsement as opposed to authoritySolution: Have a parameter m (=4 to 8) that limits the number of links to a page from within a single domain (wasn’t employed in Kleinberg’s experiments) Expanding the root set into a base set : Authoritative Sources in a Hyperlinked Environment
Computing Hubs and Authorities • Problem: Universally popular pages with large in-degreeSolution: Authorities must have considerable overlap in the pages that point to them (Hubs) • Result discussion: Comparison with Alta-Vista • An Iterative Algorithm : • Each page p has a non-negative authority weight x(p) and a non-negative hub weight y(p). Initialize these to be 1. • If p points to many pages with large x-value then it should receive a large y-value. If p has a lot of pages with large y-values pointing to it, it should have a large x-value. • I operation update x-weights : • O operation updates y-weights : • Thus I and O are the basic means by which hubs and authorities reinforce one another • Normalize so that sum of squares =1 (separately) • Apply the I and O operations alternatively until convergence (usually 20 iterations) • We now take the top authority scores as top results • Hub pages: These are pages that have links to multiple relevant authoritative pages. • It is these hub pages that “pull together” authorities on a common topic, and allow us to throw out unrelated pages of large in-degree Authoritative Sources in a Hyperlinked Environment
Similar-Page Queries • How do we find a page similar to a given page? • Root Set initialization: • Normal Search with query string σ:“Find t pages containing the string σ” • Similar Page search: “Find t pages pointing to the p” • Rest of the algorithm is the same! • If p is highly referenced page there is an Abundance problemSolution: Restrict to local region around page in question, and find the strongest authorities • Result Discussion: Localized algorithm gives better results than just in-degree ranking • Con:Bias towards competitive nature vs co-operative. Authoritative Sources in a Hyperlinked Environment
Connections with Related Work • Standing, Impact and Influence • Social Networks - Standing • Scientific Citations and bibliometrics - Impact • Hypertext and WWW Rankings • Index Node – out-degree is higher than average out-degree • Reference Node – in-degree is higher than average in-degree • Other Link-Based Approaches to WWW Search • Clustering of Link Structures Authoritative Sources in a Hyperlinked Environment
Multiple sets of hubs and authorities • Problem: • The query string may have several very different meanings : “jaguar” • The string may arise as a term in the context of multiple technical communities : “randomized algorithms” • The string may refer to a highly polarized issue, involving groups that are not likely to link to one another : “abortion” • Clusters bring in abundance problemsScatter/Gather Authoritative Sources in a Hyperlinked Environment
Related Math (contd) • Non principal eigenvectors: • Eigenvector: Densely linked collection of hubs and authorities in the sub-graph • Calculated by iterations as described previously • However, this may not contain all info desired by searchExample: “Jaguar”produces 3 strong eigenvectors due to different meanings of the word: • Jaguar – the car • Jaguar – the cat • Atari Jaguar • The Jacksonville Jaguars NFL team • The initial pages decide which of these is returned as principal eigenvector • Hence we may miss relevant eigenvectors, if we don’t know the context of the search (Research topic) Authoritative Sources in a Hyperlinked Environment
Related Math Matrices Image Credits: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html A: adjacency matrix of the graph SQand Update ops: Initial weights of the nodes: After k steps: Theorems Proved:1) Convergence2) Convergence limits are the principle Eigenvectors Authoritative Sources in a Hyperlinked Environment
Diffusion and Generalization • Diffusion : Authoritative pages corresponding to competing, “broader” topics will win out over the pages relevant to query, and be returned by the algorithm • Pro: The broader topic that supplants the original, too-specific query very often represents a natural generalization of the query. As such, it provides a simple way of abstracting a specific query topic to a broader, related one Authoritative Sources in a Hyperlinked Environment
Evaluation • Evaluating the measure “authority” is inherently based on human judgment • Testing with more heuristics • Anchor text to weight individual links differentially • CLEVER vs Yahoo vs AltaVista – user survey • Subjective definition of relevance and quality (again!) Authoritative Sources in a Hyperlinked Environment
Conclusion Going beyond relevance and clustering: building authoritative sources Quality of results is focused in the context of what is available in WWW globally A global notion of structure is inferred without directly maintaining an index of the www or its link structure With the goal of discovering authoritative pages a more complex pattern of social organization on the www, in which hub pages link densely to a set of thematically related authorities is discovered. Examples of extensions Authoritative Sources in a Hyperlinked Environment
Opinion • Good results, but “topic drifts” occur: • The hub weights of some sites such as yahoo.com or eBay.com cause irrelevant clusters to be identified as major eigenvectors. • Kleinberg’s algorithm also uses only the top authority scores, but there may be useful pages that rank strongly as hubs • No way to incorporate user feedback • Using SVM to generate appropriate results – Greg Nilsen’s experiment, University of Pittsburgh Authoritative Sources in a Hyperlinked Environment