1 / 41

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg. Presented By:- Vikrant Khosla Sridhar Kameswara Nemani. Outline. Search on WWW – Problem in general Overview of the authoritative approach proposed by this paper Constructing a focused Subgraph

jenny
Download Presentation

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authoritative Sources in aHyperlinked environmentJon M. Kleinberg Presented By:- Vikrant KhoslaSridhar Kameswara Nemani

  2. Outline • Search on WWW – Problem in general • Overview of the authoritative approach proposed by this paper • Constructing a focused Subgraph • Computing Hubs and Authorities • Similar page Queries • Multiple Sets of Hubs and Authorities • Diffusion and generalization • Evaluation • Conclusion

  3. General Problem • How to improve quality of search on WWW? • Quality of search requires human evaluation due to the subjectivity inherent in notions such as relevance. • The WWW is a hypertext corpus of enormous complexity and information. • This paper aims to create link based model that consistently identifies relevant, authoritative WWW pages for broad search topics.

  4. Understand Query Types • There is more than one type of query and the handling of each may require different techniques. • Type of queries: • Specific queries E.g. “Does Netscape support the JDK 1.1 code-signing API?” • Broad-topic queries E.g. “Find information about the Java programming language.” • Similar page queries Example: Find pages ‘similar ’ to honda.com

  5. Difficulty in Handling query • Specific queries: • Scarcity Problem- There are few pages containing those information and it is difficult to determine the identity of those pages. • Broad topic queries: • Abundance problem- The number of pages that could reasonably be returned as relevant is far too large for a human user to digest. • Select a small set of the most “authoritative” or “definitive” ones from a huge collection of pages that are most relevant

  6. Authoritative Pages • Given a particular page, how do we tell whether it is authoritative? • Problem is related to limitations of text based analysis. • Text based ranking function • E.g. For the “harvard”, www.harvard.edu is proper authoritative page but there may be lots of other web pages containing “harvard” more often. • Most popular Pages are not sufficiently self descriptive • Usually the term “search engine” doesn’t appear on search engine home web pages of Yahoo, AltaVista, Excite etc. • Honda or Toyota home pages hardly contain the term “automobile manufacturer”.

  7. Analysis of link structure • Hyperlinks encode a latent human judgment which can be used to formulate a notion of authority. • Creation of a link represents a concrete indication of the following type of judgment • The creator of page p, by including a link to page q, has in some measure conferred authority on q. • Opportunity for the user to find potential authorities purely through the pages that point to them. • Potential Pitfalls of above concept • Most links are created for navigational purposes.(eg: main-menu, paid-adds) • Difficult to balance between appropriate relevance and popularity(eg: Yahoo)

  8. Authorities and Hubs • Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. • Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). • In-degree - Number of pointers to a page and is one simple measure of authority. • Out-degree - Number of pointers from a page to other pages.

  9. Can we operate over entire WWW ? • Local approaches- deals with intranet and amount of data is much smaller as compared to WWW as a whole. • Clustering approach- dissects a heterogeneous population into subpopulations that in some way more cohesive, but underlying problem of filtering vast number of pages is still the same. • Authoritative approach- global nature • Perform search on text based WWW search engine • Distil broad topic from these pages via the discovery of authority.

  10. Overview of search steps

  11. Overview • Search string • Text Search Engine • Authoritative Approach • Constructing focused subgraph • Computing Hub & Authorities • Better quality search result

  12. Constructing Subgraph • The collection V of hyperlinked pages can be viewed as a directed graph G=(V,E):nodes correspond to pages, and a directed edge (p,q) ε E indicates the presence of a link from p to q. • Construct a focused subgraph (S  ) of the WWW with the following properties:- • S  is relatively small (so that computation is affordable) • S  is rich in relevant pages (so that its easier to find good authority) • S  contains most (or many) of the strongest authorities

  13. How to find S • Set Q-set of all pages containing query string. • Root set R  - t highest ranked pages for the query  got from a text-based search engine. It satisfy property 1 & 2. • Problems with R  : • R is a subset of collection Q and Q does not satisfy property 3. • There are extremely few links between pages in R, rendering it essentially “structureless”. • Strong authority for query is quite likely to be pointed to by at least one page in R . • Construct Base set S by extend root set R  by including :- • All pages linked to by pages in R • All pages that link to a page in R at most d

  14. Subgraph algorithm

  15. Observation & Heuristics • Heuristic 1: Delete all intrinsic links & keep all transverse links • Intrinsic links: • if the link is between pages with the same domain name. • Generally these are for navigation purposes. • Less informative and often contain repetitive information. • Transverse: if it is between pages with different domain names. • Heuristic 2: Delete pages having collusion or keep 4 to 8 • Large number of pages from a single domain all point to a single page p. • Generally used for mass endorsement, advertisement etc.

  16. Overview • Search string • Text Search Engine • Authoritative Approach • Constructing focused subgraph • Computing Hub & Authorities • Better quality search result

  17. Computing Hubs & Authorities • Simplest approach would be to order pages by in-degree • Problem: • Nodes with highest in-degree in base set:- • might not necessarily be authorities & lack any thematic unity. • might simply be universally popular pages like yahoo, google, etc.

  18. Computing Hubs & Authorities • Observation: • Good sources of content (authorities) • Good sources of links (hubs) • True authority pages are pointed by a number of good hubs. • Mutually reinforcing relationship: • Hubs point to lots of authorities. • Authorities are pointed to by lots of hubs • We will use the iterative algorithm to break this circularity. • Terms : • Good hub: page that points to many good authorities. • Good authority: page pointed to by many good hubs.

  19. Overview of Algorithm

  20. Iterative Algorithm • An iterative algorithm • with each page p , we associate • a non-negative authority weight x<p> • a non-negative hub weight y<p> • weights of each type are normalized so their squares sum to 1 • The pages with larger x and y values have “better” authorities and hubs respectively.

  21. Iterative Algorithm • If p points to many pages with large x-values, then it should receive a large y-value • If p is pointed to by many pages with large y-values, then it should receive a large x-value • Inlinks Operation I: • Outlinks Operation O:

  22. Algorithm

  23. Matrices Basics

  24. Observations • As one applies Iterate with arbitrary large k , the vectors • Let G = (V , E ), with V = {p1 , p2 ,…, pn }, and let A denote the adjacency matrix of the graph G : the (i , j )th entry of A is 1 if (pi , pj ) is an edge of G , and is 0 otherwise. • x* is the principal eigenvector of ATA , and y* is the principal eigenvector of AAT • The convergence of Iterate is quite rapid (k =20 is sufficient)

  25. Observations • Any eigenvector algorithm can be used to compute the fixed points X* and Y* • Emphasizes the underlying motivation of the approach by reinforcing I and O operations • Do not require to iterate I and O to convergence • Can start from initial vector X0 and Y0 and computer using a fixed bound of I and O operations

  26. Example: Mini Web

  27. Example: Mini Web (Cont..)

  28. Basic Results

  29. Observations • Just “pure ” analysis of link structure • We ignored the text in searching for authoritative pages. • i.e., text-based search is just an initial set • Pages legitimately considered as authoritative in the context of www without access to large- scale index of the www • i.e., global analysis of the full www link structure can be replaced by local method over small focused subgraph • This approach can replace local approaches used in intranet

  30. Similar page queries • Example: Find pages ‘similar ’ to honda.com • Using link structure to infer a notion of “similarity” among pages • We have found a page p that is of interest and it’s an authoritative page on a topic. • Can this help in finding similar pages? • What do users of the WWW consider to be related to p when they create pages and links ?

  31. Similar page queries • Previously our request to search engine was: • “Find t pages containing the string . • Now our request to search engine is: • “Find t pages pointing to p” • Rp root set • Sp base set • Gp focused subgraph • Strongest authorities in the local region of the link structure near p are the potential broad-topic summary of pages related to p.

  32. Results- Similar page queries

  33. Multiple Sets of Hubs & Authorities • Several densely linked collections of hubs and authorities within the same set. • Example: • “jaguar” – has several different meanings. • “randomized algorithms” – arises multiple technical communities. • “abortion” - -involves groups that may not be linked to each other. • Clustering in presence of Abundance problem is needed.

  34. Multiple Sets of Hubs & Authorities • The non-principal Eigenvectors provide us a way to extract additional densely linked collections of hubs and authorities. • Non-principal eigenvectors will have both positive and negative entries. • Often, the highly positive entries will correspond to a cluster of pages and negative entries to a different cluster. • Typically the two clusters will not be tightly intertwined. intertwined.

  35. Jaguar Example • Authority principal eigenvector is primarily about the Atari product. • In the positive end of the 2nd non-principal eigenvector, the pages are primarily about the Jacksonville Jaguars. • In the positive end of the 3rd non-principal eigenvector, the pages are primarily about the car.

  36. Randomized Algorithms Example • The first non-principal eigenvector, positive end returned home pages of theoretical computer scientists. • First non-principal eigenvector’s, negative end returns compendia of mathematical software. • In the negative end of the fourth non-principal eigenvector, the pages are primarily about wavelets.

  37. Diffusion and Generealization • The query may not be sufficiently “broad.” • In this case there will not be enough highly relevant pages in the base set to extract a sufficiently dense sub-graph of relevant hubs and authorities. • When this occurs, the collection will often represent a broader topic, and the results will reflect a diffused version of the initial query. • Example: “WWW conferences” -> WWW resource pages. resource pages.

  38. In studies conducted in 1998 over 26 queries and 37 volunteers, Clever reported better authorities than Yahoo!, which in turn was better than Alta Vista.

  39. Conclusion • Need a way to distill a broad topic, for which there may be millions of relevant pages • Provides a high quality results in context of what is available on the www globally • Operate without maintaining an index of the www or its link structure • It identifies the complex pattern of social organization on the www.

  40. References • http://crystal.uta.edu/~gdas/Courses/websitepages/spring07DBIR.htm • An Intro to Information Retrieval by Manning and Raghaban • Information Retrieval Data Structures Algorithms - William B. Frakes • Random Walks in Ranking Query Results in Semistructured Databases slides by Vagelis Hristidis

  41. Thank You

More Related