330 likes | 423 Views
Authoritative Sources in a Hyperlinked environment . Presented by, Lokesh Chikkakempanna. Agenda. Introduction. Central Issue. Queries. Constructing a focused subgraph . Computing hubs and authorities. Extracting authorities and hubs. Similar page queries. conclusion. Introduction.
E N D
Authoritative Sources in a Hyperlinked environment Presented by, Lokesh Chikkakempanna
Agenda • Introduction. • Central Issue. • Queries. • Constructing a focused subgraph. • Computing hubs and authorities. • Extracting authorities and hubs. • Similar page queries. • conclusion
Introduction • Process of discovering pages that are relevant to a particular query. • A hyperlinked environment can be a rich source of information. • Analyzing the link structure of WWW environment. • The WWW is a hypertext corpus of enormous complexity, and it continues to expand at very fast rate. • High level structure can only emerge through the complete analysis of the WWW environment.
Central Issue • Distillation of broad search topics through the discovery of “Authoritative” information sources. • Link analysis for discovering “authoritative pages” • Improving the quality of search methods on WWW is a rich and interesting problem, because it should be both algoritmic and storage efficient. • What does a typical search tool computes in the extra time it takes to produce results that are of greater value to the user? • There is no objective function that is concretely defined and correspond to human notions of quality..
Queries • Types of queries • -specific queries: lead to scarcity problem. • -Broad topic queries: Abundance problem. • Filter and provide from a huge set of relevant pages, A small set of the most “authoritative” or “definitive” ones.
Problems in identifying authorities • Example: “harvard” • There are over million pages on web that use the term “harvard”. • Remember “TF”- Term frequency. • How do we circumvent this problem?
Link analysis • Human judgement is needed to formulate the notion of authority. • If a person includes a link for page q in page p, He has conferred authority on q in some measure. • What are the problems in this?
Links may be created for various reasons. • Example: • for navigational purposes. • Paid advertisements. • A hacker may create a bot that keeps on adding links to all the pages. • Solution?
Link-based model for the Conferral of Authority • Identifies relevant authoritative www pages for broad search topics. • Based on the relationship between authorities and hubs. • Exploit the equilibrium between authorities and hubs to develop an algorithm that identifies both type of pages simultaneously.
Algorithm operates on focused subgraph produced by text based search engines. • Produces small collection of pages likely to contain the most authoritative pages for a given topic. • Example: Alta Vista
Constructing a focused subgraph of www • We can view any collection V of hyperlinked pages as a directed graph G=(V,E) • The nodes correspond to the pages. • Edge(p,q) indicates the presence of a link from p to q. • Construct a subgraph on www on which the algorithm operates.
The Goal is to focus the computational effort on relevant pages. • (i) S(sigma) is relatively small. • (ii)S(sigma) is rich in relevant pages. • (iii) S(sigma) contains most (or many) of the strongest authorities. • How to find such a collection of pages?
“t” highest ranked pages for the query (sigma) from a text based search engine. • These “t” pages are refered as root set R(sigma) • The root set satisfies both conditions (i) and (ii) • It is far from satisfying (iii) . Why?
There are often extremely few links between pages in R(sigma), Rendering it essentially structureless. • Eample: root set for the query “java” contained 15 links between pages in different domains. • Total number of possible links 200*199. (t=200)
We can use the root set R(sigma) to produce s(sigma) that satisfies all the conditions. • A strong authority may not be in the set R(sigma), but it is likely to be pointed to by atleast one page in R(sigma). • Subgraph(sigma,€,t,d) • Sigma: query string,€-a text based search engine,t and d are natural numbers.
S(sigma) is obtained by growing R(sigma) to include any page pointed to by a page in R(sigma) and any page that points to a page in R(sigma). • A single page in R(sigma) brings atmost d pages into S(sigma). • Does this S(sigma) contains authorities?
Heuristics to reduce S(sigma) • Two types of links: • Transverse: between pages with different domain names. • Intrinsic: between pages with the same domain name. • Remove all the intrinsic links to get a graph G(sigma)
A large number of pages from a single domain point to a page p. • This is because of advertisements. • Allow only m≈4-8 pages from a single domain to point to any given page p. • G(sigma) now contains many relevant pages and strong authorities.
Computing hubs and authorities • Extracting authorities based on maximum indegree does not work. • Example: For the query “java” the largest indegree pages consisted of www.gamelan.com and java.sun.com, together with advertising pages and home page of amazon. • While the first two are good answers, others are not relevant.
Authoritative pages relevant to the initial query should not only have large in-degree; • Since they are all authorities , there should be considerable overlap in the sets of pages that point to them. • Thus in addition to authorities we should find what are called as hub pages. • Hub pages: That have links to multiple relevant authoritative pages.
Hub pages allow us to throw away unrelated pages with high indegree. • Mutually reinforcing relationship: A good hub is page that points to many good authorities; a good authority is a page that is pointed to by many good hubs. • We should break this circularity to identify hubs and authorities. • How?
An Iterative algorithm • Maintains and updates numerical weights for each page. • Each page is assosciated with a non-negative authority weight x^p and non-negative hub weight y^p. • Each type is normalized so their squares sum to 1. • Pages with larger x and y values are considered better authorities and hubs respe∞ctively. • Two operations for weights.
The set of weights is represented as a vector with a co-ordinate for each page in G(sigma). The set of weights is represented as a vector y.
Iterate(G,k) G: a collection of n linked pages k: a natural number Let z denote the vector (1, 1, 1, ..., 1)εRn. Set x0 :=z. Set y0 :=z. For i=1, 2, . . . , k Apply the ϑ operation to (xi-1, yi-1), obtaining new x-weightsxi’. Apply the Θ operation to (xi, yi-1), obtaining new y-weights yi’. Normalize xi’, obtaining xi. Normalize yi’, obtaining yi. End Return (xk, yk).
Filter out top c authorities and top c hubs Filter(G,k,c) G: a collection of n linked pages k,c: natural numbers (xk, yk) :=iterate(G, k). Report the pages with the c largest coordinates in xk as authorities. Report the pages with the c largest coordinates in yk as hubs.
The is applied with G set equal to G(sigma) and c ≈ 5-10 • With arbitrarily large values of k, the sequences of vectors {xk} and {yk} converge to fixed points x* and y *. • What is R^n in the ITERATE algorithm?. • Eigenspaceassosciated with λ. • λ is the eigen value of an n x n matrix M, with the property that Mω=λω. For some vector ω
Similar-Page Queries • The algorithm discussed can be applied to another type of problem. • Using the link structure to infer the notion of “similarity” among pages. • We begin with the page p and pose the request “Find t pages pointing to p
Conclusion • The approach developed here might be integrated into a study of traffic patterns on www. • Future work can be done to include other than broad topic queries. • It would be interesting to understand eigenvector based heuristics completely in the context of algorithms presented here.