1 / 59

Lecture 6: (Week 6) Web (Social) Structure Mining

This lecture covers the structure of the web, including hyperlink structures, the relevance of closely linked web pages, and the physical structure of the Internet. It also explores the features of HTTP and HTML in the context of web publishing. Interesting websites and the concept of directed virtual links are discussed.

minniem
Download Presentation

Lecture 6: (Week 6) Web (Social) Structure Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 6: (Week 6)Web (Social) Structure Mining • Hypertext Transfer Protocol • Hyperlink structures of the web • Relevance of closely linked web pages • In-degree as a measure of popularity • Physical Structure of Internet • Interesting websites: • http://prj61/GoogleTest3/GoogleSearch.aspx • http://www.google.com.hk/ Dr. Wang

  2. Hyperlinks • The Hypertext Transfer Protocol (HTTP/1.0) • An application-level protocol • with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. • A generic, stateless, object-oriented protocol • which can be used for many tasks such as name servers and distributed object management systems, through extension of its request methods (commands). • An important feature: the typing of data representation allows systems to be built independently of the data being transferred • Based on request/response paradigm • In use by the WWW global information initiative since 1990. • http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.txt Dr. Wang

  3. HTML (Hypertext Markedup Language) • HTML - the Hypertext Markup Language - is the lingua franca for publishing on the World Wide Web. Having gone through several stages of evolution, today's HTML has a wide range of features reflecting the needs of a very diverse and international community wishing to make information available on the Web. • see: http://www.w3.org/MarkUp/ Dr. Wang

  4. Web Structure • Directed Virtual Links • Special Features: Huge, Unknown, Dynamic • Directed Graph Functions: backlink, shortest path, cliques Dr. Wang

  5. An Unknown Digraph • Hyperlinks pointed to other web pages from each web page form a virtual directed graph • Hyperlinks are added and deleted at will by individual web page authors • Web pages may not know their incoming hyperlinks • The Digraph is dynamic: • Central control of the hyperlinks are not possible Dr. Wang

  6. The digraph is dynamic • Search engines can only map a fraction of the whole web space • Even if we can manage the size of the digraph, its dynamic nature requires constant update of the map • There are web pages that are not documented by any search engine. Dr. Wang

  7. The structure of the Digraph • Nodes: web pages (URL) • Directed Edges: hyperlink that from one web page to another • Content of a node: the content contained in its associate web page • dynamic nature of the digraph: • For some nodes, there are outgoing edges which we don’t know yet. • Nodes that not yet processed • new edges (hyperlinks) may have added to some nodes • For all the nodes, there are some incoming edges which we do not yet know. Dr. Wang

  8. Useful Functions of the Digraph • Backlink (the_url): • find out all the urls that point to the_url • Shortest_path (url1, url2): • return the shortest path from url1 to url2 • Maximal clique (url): • return a maximal clique that contains url Dr. Wang

  9. V0 V1 V3 V2 V4 V5 V6 An ordinary digraph H with 7 vertices and 12 edges Dr. Wang

  10. V0 V1 2 10 4 1 3 V3 V2 2 2 V4 5 8 4 6 1 V5 V6 A partially unknown digraph H with 7 vertices and 12 edges but node v5 is not yet explored. We don’t know the outgoing edges from v5 though we know its existence (by its URL). Dr. Wang

  11. Map of the hyperlink space • To construct it, one needs • a spider to automatically collect URLs • a graph class to store information for nodes(URLs) and links (hyperlinks) • The whole digraph (URLs,HYPERLINKs) is huge: • 162,128,493 hosts (2002 Jul data from http://www.isc.org/ds/WWW-200207/index.html ) • One may need graph algorithms with secondary memories Dr. Wang

  12. Spiders • Automatically Retrieve web pages • Start with an URL • retrieve the associated web page • Find all URLs on the web page • recursively retrieve not-yet searched URLs • Algorithmic Issues • How to choose the next URL? • Avoid overloaded sub-networks

  13. Spider Architecture Add a new URL Web Space Shared URL pool Http Request url_spider url_spider url_spider url_spider url_spider spiders Http Response Get an URL Database Interface Database Dr. Wang

  14. The spider • Does that automatically (without clicking on a line nor type a URL) • It is an automated program that search the web. • Read a web page • store/index the relevant information on the page • follow all the links on the page (and repeat the above for each link) Dr. Wang

  15. Internet Growth Charts Dr. Wang

  16. Internet Growth Charts Dr. Wang

  17. Partial Map • Partial map may be enough for some purposes • e.g., we are often interested in a small portion of the whole Internet • A partial map may be constructed within the memory space for an ordinary PC. Therefore it may allow fast performance. • However, we may not be able to collect all information that are necessary for our purpose • e.g., back links to a URL Dr. Wang

  18. Back link • Hyperlinks on the web are forward types. • One web page may not know the hyperlinks that point to itself • authors of web pages can freely link to other documents in the cyperspace without consent from their authors • Back links may be of value • in scientific literature, SCI (Scientific Citation Index) is an important index for judgement of academic value of one academic article • www.isinet.com • It is not easy to find all the back links Dr. Wang

  19. Discovery of Back links • Provided in Advanced Search Features of Several Search Engines • Search from google is done by typing • link:url • as your keyword • Example: 1. go to www.google.com and 2: link:http://www.cs.cityu.edu.hk/~lwang • homework: find how to retrieval back links from other search engines • http://decweb.ethz.ch/WWW7/1938/com1938.htm • Surfing the Web Backwards • www8.org/w8-papers/5b-hypertext-media/surfing/surfing.html Dr. Wang

  20. Web structure mining • Some information is embedded in the digraph • Usually, the hyperlinks from a web page to other web pages are chosen because they are important and contain useful related information in the view of the web page author • e.g., fans of football may all have links pointing to their favorite foot teams • Some basic technology tools for web structure mining: • a spider • a graph class • some relevant graph algorithms • a back link retrieval function Dr. Wang

  21. Intellectual Content of Hyperlink Structure • Fans • Page Rank • Densely Connected Subgraphs as Communities Dr. Wang

  22. FAN: • Fans of a web page often put a link toward the web page. • It is usually done manually after a user have accessed the web page and looked at the content of the web page. Dr. Wang

  23. fans a web page Fans of a web page Dr. Wang

  24. FANs as an indicator of popularity: • The more a web page’s fans are, the more popular it is. • SCI (Scientific Citation Index), for example, is a well established method to rate the importance of a research paper published in international journals. • It is somewhat controversial for importance since some important work may not be popular • But it is a good indicator of the influences Dr. Wang

  25. An Objection to FANs as an indicator of importance: • Some of the more popular web pages are so well known that people may not put them in their web pages • On the assumption that some web pages are more important than others, How to compare • a web page linked to by important web pages • another linked to by less important web pages? Dr. Wang

  26. PageRank • Two influencing factors for the rank of a web page: • The rank of web pages pointing to it • The high the better • The number of links in the web pages pointing out • The lessthe better Dr. Wang

  27. Definition • Webpages are ranked according to their page ranks calculated as follows: • Assume page A has pages T1...Tn which point to it (i.e., back links or citations). • Choose a parameter d that is a damping factor which can be set between 0 and 1 (usually set d to 0.85) • C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Dr. Wang

  28. Calculation of PageRank • Notice that the definition of PR(A) is cyclic. • I.e., ranks of web pages are used to calculate the ranks of web pages, • However, PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. • It is reported that a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. Dr. Wang

  29. An example b a c Dr. Wang

  30. PageRank of example graph • Start with PR(a)=1, PR(b)=1, PR(c) =1 • Apply PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • For simplicity, set d=1, and recale that C(): outdegree • After first iteration, we have • PR(a)=1, PR(b)=1/2, PR( c) =3/2 • For the second iteration, we have • PR(a)=3/2, PR(b)=1/2, PR( c)=1 • Subsequent iterations: • a:1 b:3/4 c:5/4 • a:5/4 b:1/2 c:5/4 • in the limit • PR(a)=6/5, PR(b)=3/5, PR( c)=6/5 Dr. Wang

  31. An example b: C(b)=1 PR(b)=1 a:C(a)=2 PR(a)=1 PR(c)=1 c: C(c)=1 UPDATE: PR(a)=PR( c)/ C (c ) =1 PR(b) = PR(a)/C(a)=1/2 PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/2 Dr. Wang

  32. An example b: C(b)=1 PR(b)=1/2 a:C(a)=2 PR(a)=1 c: C(c)=1 PR(c)=3/2 UPDATE: PR(a)=PR( c)/ C (c ) =3/2 PR(b) = PR(a)/C(a)=1/2 PR( c )=PR(a)/C(a)+PR(b)/C(b)=1/2+1/2=1 Dr. Wang

  33. An example b: C(b)=1 PR(b)=1/2 a:C(a)=2 PR(a)=3/2 c: C(c)=1 PR(c)=1 UPDATE: PR(a)=PR( c)/ C (c ) =1 PR(b) = PR(a)/C(a)=3/4 PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/4+1/2=5/4 Dr. Wang

  34. Bringing Order to the Web • Used maps containing as many as 518 million of these hyperlinks. • These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. • For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at google.stanford.edu). • For the type of full text searches in the main Google system, PageRank also helps a great deal. • As reported in “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, by Sergey Brin and Lawrence Page Dr. Wang

  35. Do Densely Connected Sub-graphsrepresent Web Sub-communities? • Inferring Web Communities from Link Topology • http://citeseer.nj.nec.com/36254.html • Efficient Identification of Web Communities • http://citeseer.nj.nec.com/flake00efficient.html • Friends and Neighbors on the Web • http://citeseer.nj.nec.com/adamic01friends.html Dr. Wang

  36. An idea: Complete sub-graphs • there is a group of URLs such that • each URL has a link to every other URL in the group • This is an evidence that each author of the web page is interested in every other web pages in the sub-group Dr. Wang

  37. Another idea: Complete bipartite sub-graphs • Complete Bipartite graph: • two groups of nodes, U and V • for each node u in U and each node v in V • there is an edge from u to v • References • D. Gibson J. Kleinberg, and Raghavan. Inferring web communities from link topology, In Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998. • T. Murata. Finding Related Web Pages based on connectivity information from a search engine. The 10 th WWW Conference Dr. Wang

  38. Problem Description • Suppose one is familiar with some Web pages of specific topic, such as, sports • Problem: to find more pages about the same topic • Web community: entity of related web pages ( centers ) Dr. Wang

  39. Search of fans using a search engine • Use input URLs as initial centers • Search URLs referring to all the centers by backlink search from the centers • Fixed number of high-ranking URLs are selected as fans Dr. Wang

  40. Adding a new URL to centers • Acquire fans’ HTML files through internet • Extract hyperlinks in the HTML files • Sort the hyperlinks in order of frequency • Add Top-ranking hyperlink to centers • Delete fans not referring to all the centers Dr. Wang

  41. Web Community • Repeat previous steps until few fans left • Acquired centers are regarded as a WEB COMMUNITY Dr. Wang

  42. centers fans Web community Web Community Centers: many web pages go there Dr. Wang

  43. Drawbacks • Maximum clique is difficult to find (NP-hard problem) • The rough idea of closely linked URLs is right but completely connected subgraph may not be. It is often the case that some links may be missing • fans may not have hyperlinks to centers that are created after their web pages are created Dr. Wang

  44. Minimum Cut Paradigm • Maximum clique is difficult to find (NP-hard problem) • The rough idea of closely linked URLs is right but completely connected subgraph may not be. It is often the case that some links may be missing • fans may not have hyperlinks to centers that are created after their web pages are created • A minimum cut of a digraph (V,A) is a partition of the node set V into two subsets U and W such that the number of edges from U to W is minimized. • It captures the notion of U and W are NOT closely linked. • Therefore, nodes in U are more closely related than with nodes in W. Dr. Wang

  45. General approach • Find a min-cut using maximum flow algorithm • if the minimum cut is sufficiently large, keep it and report the nodes as a web community • else • remove the edges associated with the minimum cut to split the digraph into two connected components • repeat on each of the two connected component Dr. Wang

  46. a b c j d i e f h g Dr. Wang

  47. a b c j d e f h g Dr. Wang

  48. a b c j d e f g Dr. Wang

  49. Efficient Identification of Web Community • A heuristic implementation of the minimum cut paradigm for web community • Gray William Flake, Steve Lawrence, and C. Lee Giles • Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD2000) pp.150-160, August 2000, Boston, USA Dr. Wang

  50. Problem Description • Given some web pages, • Problem: find a community of related pages. • Community: a set of web pages that link (in either direction) to more web pages in the community than to pages outside the community Dr. Wang

More Related