590 likes | 600 Views
This lecture covers the structure of the web, including hyperlink structures, the relevance of closely linked web pages, and the physical structure of the Internet. It also explores the features of HTTP and HTML in the context of web publishing. Interesting websites and the concept of directed virtual links are discussed.
E N D
Lecture 6: (Week 6)Web (Social) Structure Mining • Hypertext Transfer Protocol • Hyperlink structures of the web • Relevance of closely linked web pages • In-degree as a measure of popularity • Physical Structure of Internet • Interesting websites: • http://prj61/GoogleTest3/GoogleSearch.aspx • http://www.google.com.hk/ Dr. Wang
Hyperlinks • The Hypertext Transfer Protocol (HTTP/1.0) • An application-level protocol • with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. • A generic, stateless, object-oriented protocol • which can be used for many tasks such as name servers and distributed object management systems, through extension of its request methods (commands). • An important feature: the typing of data representation allows systems to be built independently of the data being transferred • Based on request/response paradigm • In use by the WWW global information initiative since 1990. • http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.txt Dr. Wang
HTML (Hypertext Markedup Language) • HTML - the Hypertext Markup Language - is the lingua franca for publishing on the World Wide Web. Having gone through several stages of evolution, today's HTML has a wide range of features reflecting the needs of a very diverse and international community wishing to make information available on the Web. • see: http://www.w3.org/MarkUp/ Dr. Wang
Web Structure • Directed Virtual Links • Special Features: Huge, Unknown, Dynamic • Directed Graph Functions: backlink, shortest path, cliques Dr. Wang
An Unknown Digraph • Hyperlinks pointed to other web pages from each web page form a virtual directed graph • Hyperlinks are added and deleted at will by individual web page authors • Web pages may not know their incoming hyperlinks • The Digraph is dynamic: • Central control of the hyperlinks are not possible Dr. Wang
The digraph is dynamic • Search engines can only map a fraction of the whole web space • Even if we can manage the size of the digraph, its dynamic nature requires constant update of the map • There are web pages that are not documented by any search engine. Dr. Wang
The structure of the Digraph • Nodes: web pages (URL) • Directed Edges: hyperlink that from one web page to another • Content of a node: the content contained in its associate web page • dynamic nature of the digraph: • For some nodes, there are outgoing edges which we don’t know yet. • Nodes that not yet processed • new edges (hyperlinks) may have added to some nodes • For all the nodes, there are some incoming edges which we do not yet know. Dr. Wang
Useful Functions of the Digraph • Backlink (the_url): • find out all the urls that point to the_url • Shortest_path (url1, url2): • return the shortest path from url1 to url2 • Maximal clique (url): • return a maximal clique that contains url Dr. Wang
V0 V1 V3 V2 V4 V5 V6 An ordinary digraph H with 7 vertices and 12 edges Dr. Wang
V0 V1 2 10 4 1 3 V3 V2 2 2 V4 5 8 4 6 1 V5 V6 A partially unknown digraph H with 7 vertices and 12 edges but node v5 is not yet explored. We don’t know the outgoing edges from v5 though we know its existence (by its URL). Dr. Wang
Map of the hyperlink space • To construct it, one needs • a spider to automatically collect URLs • a graph class to store information for nodes(URLs) and links (hyperlinks) • The whole digraph (URLs,HYPERLINKs) is huge: • 162,128,493 hosts (2002 Jul data from http://www.isc.org/ds/WWW-200207/index.html ) • One may need graph algorithms with secondary memories Dr. Wang
Spiders • Automatically Retrieve web pages • Start with an URL • retrieve the associated web page • Find all URLs on the web page • recursively retrieve not-yet searched URLs • Algorithmic Issues • How to choose the next URL? • Avoid overloaded sub-networks
Spider Architecture Add a new URL Web Space Shared URL pool Http Request url_spider url_spider url_spider url_spider url_spider spiders Http Response Get an URL Database Interface Database Dr. Wang
The spider • Does that automatically (without clicking on a line nor type a URL) • It is an automated program that search the web. • Read a web page • store/index the relevant information on the page • follow all the links on the page (and repeat the above for each link) Dr. Wang
Internet Growth Charts Dr. Wang
Internet Growth Charts Dr. Wang
Partial Map • Partial map may be enough for some purposes • e.g., we are often interested in a small portion of the whole Internet • A partial map may be constructed within the memory space for an ordinary PC. Therefore it may allow fast performance. • However, we may not be able to collect all information that are necessary for our purpose • e.g., back links to a URL Dr. Wang
Back link • Hyperlinks on the web are forward types. • One web page may not know the hyperlinks that point to itself • authors of web pages can freely link to other documents in the cyperspace without consent from their authors • Back links may be of value • in scientific literature, SCI (Scientific Citation Index) is an important index for judgement of academic value of one academic article • www.isinet.com • It is not easy to find all the back links Dr. Wang
Discovery of Back links • Provided in Advanced Search Features of Several Search Engines • Search from google is done by typing • link:url • as your keyword • Example: 1. go to www.google.com and 2: link:http://www.cs.cityu.edu.hk/~lwang • homework: find how to retrieval back links from other search engines • http://decweb.ethz.ch/WWW7/1938/com1938.htm • Surfing the Web Backwards • www8.org/w8-papers/5b-hypertext-media/surfing/surfing.html Dr. Wang
Web structure mining • Some information is embedded in the digraph • Usually, the hyperlinks from a web page to other web pages are chosen because they are important and contain useful related information in the view of the web page author • e.g., fans of football may all have links pointing to their favorite foot teams • Some basic technology tools for web structure mining: • a spider • a graph class • some relevant graph algorithms • a back link retrieval function Dr. Wang
Intellectual Content of Hyperlink Structure • Fans • Page Rank • Densely Connected Subgraphs as Communities Dr. Wang
FAN: • Fans of a web page often put a link toward the web page. • It is usually done manually after a user have accessed the web page and looked at the content of the web page. Dr. Wang
fans a web page Fans of a web page Dr. Wang
FANs as an indicator of popularity: • The more a web page’s fans are, the more popular it is. • SCI (Scientific Citation Index), for example, is a well established method to rate the importance of a research paper published in international journals. • It is somewhat controversial for importance since some important work may not be popular • But it is a good indicator of the influences Dr. Wang
An Objection to FANs as an indicator of importance: • Some of the more popular web pages are so well known that people may not put them in their web pages • On the assumption that some web pages are more important than others, How to compare • a web page linked to by important web pages • another linked to by less important web pages? Dr. Wang
PageRank • Two influencing factors for the rank of a web page: • The rank of web pages pointing to it • The high the better • The number of links in the web pages pointing out • The lessthe better Dr. Wang
Definition • Webpages are ranked according to their page ranks calculated as follows: • Assume page A has pages T1...Tn which point to it (i.e., back links or citations). • Choose a parameter d that is a damping factor which can be set between 0 and 1 (usually set d to 0.85) • C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Dr. Wang
Calculation of PageRank • Notice that the definition of PR(A) is cyclic. • I.e., ranks of web pages are used to calculate the ranks of web pages, • However, PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. • It is reported that a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. Dr. Wang
An example b a c Dr. Wang
PageRank of example graph • Start with PR(a)=1, PR(b)=1, PR(c) =1 • Apply PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • For simplicity, set d=1, and recale that C(): outdegree • After first iteration, we have • PR(a)=1, PR(b)=1/2, PR( c) =3/2 • For the second iteration, we have • PR(a)=3/2, PR(b)=1/2, PR( c)=1 • Subsequent iterations: • a:1 b:3/4 c:5/4 • a:5/4 b:1/2 c:5/4 • in the limit • PR(a)=6/5, PR(b)=3/5, PR( c)=6/5 Dr. Wang
An example b: C(b)=1 PR(b)=1 a:C(a)=2 PR(a)=1 PR(c)=1 c: C(c)=1 UPDATE: PR(a)=PR( c)/ C (c ) =1 PR(b) = PR(a)/C(a)=1/2 PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/2 Dr. Wang
An example b: C(b)=1 PR(b)=1/2 a:C(a)=2 PR(a)=1 c: C(c)=1 PR(c)=3/2 UPDATE: PR(a)=PR( c)/ C (c ) =3/2 PR(b) = PR(a)/C(a)=1/2 PR( c )=PR(a)/C(a)+PR(b)/C(b)=1/2+1/2=1 Dr. Wang
An example b: C(b)=1 PR(b)=1/2 a:C(a)=2 PR(a)=3/2 c: C(c)=1 PR(c)=1 UPDATE: PR(a)=PR( c)/ C (c ) =1 PR(b) = PR(a)/C(a)=3/4 PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/4+1/2=5/4 Dr. Wang
Bringing Order to the Web • Used maps containing as many as 518 million of these hyperlinks. • These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. • For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at google.stanford.edu). • For the type of full text searches in the main Google system, PageRank also helps a great deal. • As reported in “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, by Sergey Brin and Lawrence Page Dr. Wang
Do Densely Connected Sub-graphsrepresent Web Sub-communities? • Inferring Web Communities from Link Topology • http://citeseer.nj.nec.com/36254.html • Efficient Identification of Web Communities • http://citeseer.nj.nec.com/flake00efficient.html • Friends and Neighbors on the Web • http://citeseer.nj.nec.com/adamic01friends.html Dr. Wang
An idea: Complete sub-graphs • there is a group of URLs such that • each URL has a link to every other URL in the group • This is an evidence that each author of the web page is interested in every other web pages in the sub-group Dr. Wang
Another idea: Complete bipartite sub-graphs • Complete Bipartite graph: • two groups of nodes, U and V • for each node u in U and each node v in V • there is an edge from u to v • References • D. Gibson J. Kleinberg, and Raghavan. Inferring web communities from link topology, In Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998. • T. Murata. Finding Related Web Pages based on connectivity information from a search engine. The 10 th WWW Conference Dr. Wang
Problem Description • Suppose one is familiar with some Web pages of specific topic, such as, sports • Problem: to find more pages about the same topic • Web community: entity of related web pages ( centers ) Dr. Wang
Search of fans using a search engine • Use input URLs as initial centers • Search URLs referring to all the centers by backlink search from the centers • Fixed number of high-ranking URLs are selected as fans Dr. Wang
Adding a new URL to centers • Acquire fans’ HTML files through internet • Extract hyperlinks in the HTML files • Sort the hyperlinks in order of frequency • Add Top-ranking hyperlink to centers • Delete fans not referring to all the centers Dr. Wang
Web Community • Repeat previous steps until few fans left • Acquired centers are regarded as a WEB COMMUNITY Dr. Wang
centers fans Web community Web Community Centers: many web pages go there Dr. Wang
Drawbacks • Maximum clique is difficult to find (NP-hard problem) • The rough idea of closely linked URLs is right but completely connected subgraph may not be. It is often the case that some links may be missing • fans may not have hyperlinks to centers that are created after their web pages are created Dr. Wang
Minimum Cut Paradigm • Maximum clique is difficult to find (NP-hard problem) • The rough idea of closely linked URLs is right but completely connected subgraph may not be. It is often the case that some links may be missing • fans may not have hyperlinks to centers that are created after their web pages are created • A minimum cut of a digraph (V,A) is a partition of the node set V into two subsets U and W such that the number of edges from U to W is minimized. • It captures the notion of U and W are NOT closely linked. • Therefore, nodes in U are more closely related than with nodes in W. Dr. Wang
General approach • Find a min-cut using maximum flow algorithm • if the minimum cut is sufficiently large, keep it and report the nodes as a web community • else • remove the edges associated with the minimum cut to split the digraph into two connected components • repeat on each of the two connected component Dr. Wang
a b c j d i e f h g Dr. Wang
a b c j d e f h g Dr. Wang
a b c j d e f g Dr. Wang
Efficient Identification of Web Community • A heuristic implementation of the minimum cut paradigm for web community • Gray William Flake, Steve Lawrence, and C. Lee Giles • Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD2000) pp.150-160, August 2000, Boston, USA Dr. Wang
Problem Description • Given some web pages, • Problem: find a community of related pages. • Community: a set of web pages that link (in either direction) to more web pages in the community than to pages outside the community Dr. Wang