180 likes | 189 Views
Explore the decentralized, hyperlink-based, and unstructured nature of the web's link structure, and learn about challenges faced by search engines, strategies for searching with hyperlinks, and constructing focused subgraphs for efficient searching. Delve into the computation of hubs and authorities, iterative algorithms for mining, and comparative results with popular search engines. Discover applications in taxonomy construction, web trawling, and structured information mining.
E N D
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington http://cseweb.uta.edu/~rai
Structure of WWW • Highly Decentralized • Unstructured • Hyperlink Based • Disorganized Presentation
Searching the WWW • Searching : Process of discovering high quality relevant pages in response to specific need for certain information
Challenges in Search Engines • Index based search engines returns one or million results !! • Heuristics used to rank the pages use frequency of occurrence of words • Spamming can mislead Index based search engines • Human language exhibits synonymy and polysemy • Web pages are not self descriptive
Searching with Hyperlinks • Features • Hyperlinks represent latent human judgment • Hyperlinks provides opportunity to find potential authorities • Pitfalls • Links are created for purposes other than potential authorities • Balance between popularity and relevance
Focused Subgraph of WWW • Authority : A page that is referred by many good hubs • Hub : A page that points to many good authorities • Authorities and hubs are extracted through focused subgraph which contain set of pages • Whose size is relatively small • Rich in content related to query • Contains strongest authorities
root base
Construction of Subgraph • Subgraph(, , t, d) • : a query string • : a text-based search engine • t, d : natural numbers. • Let Rdenote the top t results of on • Set S= R • For each page p R • Let +(p) denote the set of all pages p points to. • Let -(p) denote the set of all pages pointing to p • Add all pages in +(p) to S. • If |-(p)| <= d then • Add all the pages in -(p) to S • Else • Add an arbitrary set of d pages from -(p) to S • End • Return S
Pruning the Subgraph • In the graph G[S] induced by the set S • Identify the links that are transverse and intrinsic • Delete all the intrinsic links and retain only transverse links
Computing Hubs and Authorities • Associate non-negative authority weight and non-negative hub weight with each page • Weights of each type are normalized so that squares sum to 1 • Use I and O operation iteratively to update the weights • I : x<p>q:(q,,p) E y<q> • O : y<p>q:(p,,q) E x<q>
Hubs Unrelated page of Large in-degree Authorities
Iterative Algorithm • Iterate(G,k) • G: a collection of n linked pages • K: a natural numbers • Let z denote the vector (1,1,1….1) Rn • Set x0 = z • Set y0 = z • For j = 1,2, ….k • Apply the I operation to (xj-1, yj-1), obtaining new x-weights x’j • Apply the O operation to (x’j, yj-1), obtaining new y-weights y’j • Normalize x’j, obtaining xj. • Normalize y’j, obtaining yj. • End • Return(xk, yk)
Results (java) Authorities .328 http://www.gamelan.com .251 http://java.sun.com .190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.html .183 http://sunsite.unc.edu/javafaq/javafaq.html (Gates) Authorities .643 http://www.roadahead.com .458 http://www.microsoft.com .440 http://www.microsoft.com/corpinfo/bill-g.htm
Results (Contd…) Comparative results with Altavista, Yahoo, Clever on 26 broad search topics rated as “bad”, “fair”, “good”, “fantastic” For 31%, Yahoo and Clever received equivalent evaluations For 50%, Clever received a higher evaluation For 19%, Yahoo received the higher evaluation Altavista failed to receive higher evaluation on any of the 26 topics.
Applications • Constructing Taxonomies semiautomatically • Trawling the web for Emerging Cybercommunities • Mining structured information that succumbs to database techniques
Web Resources • Clever - http://www.almaden.ibm.com/cs/k53/clever.html • Google - http : //www.google.com • WebL - http://www.research.compaq.com/SRC/WebL