Lecture 6: (Week 6) Web (Social) Structure Mining

Lecture 6: (Week 6)Web (Social) Structure Mining • Hypertext Transfer Protocol • Hyperlink structures of the web • Relevance of closely linked web pages • In-degree as a measure of popularity • Physical Structure of Internet • Interesting websites: • http://prj61/GoogleTest3/GoogleSearch.aspx • http://www.google.com.hk/ Dr. Wang

Hyperlinks • The Hypertext Transfer Protocol (HTTP/1.0) • An application-level protocol • with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. • A generic, stateless, object-oriented protocol • which can be used for many tasks such as name servers and distributed object management systems, through extension of its request methods (commands). • An important feature: the typing of data representation allows systems to be built independently of the data being transferred • Based on request/response paradigm • In use by the WWW global information initiative since 1990. • http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.txt Dr. Wang

HTML (Hypertext Markedup Language) • HTML - the Hypertext Markup Language - is the lingua franca for publishing on the World Wide Web. Having gone through several stages of evolution, today's HTML has a wide range of features reflecting the needs of a very diverse and international community wishing to make information available on the Web. • see: http://www.w3.org/MarkUp/ Dr. Wang

Web Structure • Directed Virtual Links • Special Features: Huge, Unknown, Dynamic • Directed Graph Functions: backlink, shortest path, cliques Dr. Wang

An Unknown Digraph • Hyperlinks pointed to other web pages from each web page form a virtual directed graph • Hyperlinks are added and deleted at will by individual web page authors • Web pages may not know their incoming hyperlinks • The Digraph is dynamic: • Central control of the hyperlinks are not possible Dr. Wang

The digraph is dynamic • Search engines can only map a fraction of the whole web space • Even if we can manage the size of the digraph, its dynamic nature requires constant update of the map • There are web pages that are not documented by any search engine. Dr. Wang

The structure of the Digraph • Nodes: web pages (URL) • Directed Edges: hyperlink that from one web page to another • Content of a node: the content contained in its associate web page • dynamic nature of the digraph: • For some nodes, there are outgoing edges which we don’t know yet. • Nodes that not yet processed • new edges (hyperlinks) may have added to some nodes • For all the nodes, there are some incoming edges which we do not yet know. Dr. Wang

Useful Functions of the Digraph • Backlink (the_url): • find out all the urls that point to the_url • Shortest_path (url1, url2): • return the shortest path from url1 to url2 • Maximal clique (url): • return a maximal clique that contains url Dr. Wang

V0 V1 V3 V2 V4 V5 V6 An ordinary digraph H with 7 vertices and 12 edges Dr. Wang

V0 V1 2 10 4 1 3 V3 V2 2 2 V4 5 8 4 6 1 V5 V6 A partially unknown digraph H with 7 vertices and 12 edges but node v5 is not yet explored. We don’t know the outgoing edges from v5 though we know its existence (by its URL). Dr. Wang

Map of the hyperlink space • To construct it, one needs • a spider to automatically collect URLs • a graph class to store information for nodes(URLs) and links (hyperlinks) • The whole digraph (URLs,HYPERLINKs) is huge: • 162,128,493 hosts (2002 Jul data from http://www.isc.org/ds/WWW-200207/index.html ) • One may need graph algorithms with secondary memories Dr. Wang

Spiders • Automatically Retrieve web pages • Start with an URL • retrieve the associated web page • Find all URLs on the web page • recursively retrieve not-yet searched URLs • Algorithmic Issues • How to choose the next URL? • Avoid overloaded sub-networks

Spider Architecture Add a new URL Web Space Shared URL pool Http Request url_spider url_spider url_spider url_spider url_spider spiders Http Response Get an URL Database Interface Database Dr. Wang

The spider • Does that automatically (without clicking on a line nor type a URL) • It is an automated program that search the web. • Read a web page • store/index the relevant information on the page • follow all the links on the page (and repeat the above for each link) Dr. Wang

Internet Growth Charts Dr. Wang

Partial Map • Partial map may be enough for some purposes • e.g., we are often interested in a small portion of the whole Internet • A partial map may be constructed within the memory space for an ordinary PC. Therefore it may allow fast performance. • However, we may not be able to collect all information that are necessary for our purpose • e.g., back links to a URL Dr. Wang

Back link • Hyperlinks on the web are forward types. • One web page may not know the hyperlinks that point to itself • authors of web pages can freely link to other documents in the cyperspace without consent from their authors • Back links may be of value • in scientific literature, SCI (Scientific Citation Index) is an important index for judgement of academic value of one academic article • www.isinet.com • It is not easy to find all the back links Dr. Wang

Discovery of Back links • Provided in Advanced Search Features of Several Search Engines • Search from google is done by typing • link:url • as your keyword • Example: 1. go to www.google.com and 2: link:http://www.cs.cityu.edu.hk/~lwang • homework: find how to retrieval back links from other search engines • http://decweb.ethz.ch/WWW7/1938/com1938.htm • Surfing the Web Backwards • www8.org/w8-papers/5b-hypertext-media/surfing/surfing.html Dr. Wang

Web structure mining • Some information is embedded in the digraph • Usually, the hyperlinks from a web page to other web pages are chosen because they are important and contain useful related information in the view of the web page author • e.g., fans of football may all have links pointing to their favorite foot teams • Some basic technology tools for web structure mining: • a spider • a graph class • some relevant graph algorithms • a back link retrieval function Dr. Wang

Intellectual Content of Hyperlink Structure • Fans • Page Rank • Densely Connected Subgraphs as Communities Dr. Wang

FAN: • Fans of a web page often put a link toward the web page. • It is usually done manually after a user have accessed the web page and looked at the content of the web page. Dr. Wang

fans a web page Fans of a web page Dr. Wang

FANs as an indicator of popularity: • The more a web page’s fans are, the more popular it is. • SCI (Scientific Citation Index), for example, is a well established method to rate the importance of a research paper published in international journals. • It is somewhat controversial for importance since some important work may not be popular • But it is a good indicator of the influences Dr. Wang

An Objection to FANs as an indicator of importance: • Some of the more popular web pages are so well known that people may not put them in their web pages • On the assumption that some web pages are more important than others, How to compare • a web page linked to by important web pages • another linked to by less important web pages? Dr. Wang

PageRank • Two influencing factors for the rank of a web page: • The rank of web pages pointing to it • The high the better • The number of links in the web pages pointing out • The lessthe better Dr. Wang

Definition • Webpages are ranked according to their page ranks calculated as follows: • Assume page A has pages T1...Tn which point to it (i.e., back links or citations). • Choose a parameter d that is a damping factor which can be set between 0 and 1 (usually set d to 0.85) • C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Dr. Wang

Calculation of PageRank • Notice that the definition of PR(A) is cyclic. • I.e., ranks of web pages are used to calculate the ranks of web pages, • However, PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. • It is reported that a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. Dr. Wang

An example b a c Dr. Wang

PageRank of example graph • Start with PR(a)=1, PR(b)=1, PR(c) =1 • Apply PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • For simplicity, set d=1, and recale that C(): outdegree • After first iteration, we have • PR(a)=1, PR(b)=1/2, PR( c) =3/2 • For the second iteration, we have • PR(a)=3/2, PR(b)=1/2, PR( c)=1 • Subsequent iterations: • a:1 b:3/4 c:5/4 • a:5/4 b:1/2 c:5/4 • in the limit • PR(a)=6/5, PR(b)=3/5, PR( c)=6/5 Dr. Wang

An example b: C(b)=1 PR(b)=1 a:C(a)=2 PR(a)=1 PR(c)=1 c: C(c)=1 UPDATE: PR(a)=PR( c)/ C (c ) =1 PR(b) = PR(a)/C(a)=1/2 PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/2 Dr. Wang

An example b: C(b)=1 PR(b)=1/2 a:C(a)=2 PR(a)=1 c: C(c)=1 PR(c)=3/2 UPDATE: PR(a)=PR( c)/ C (c ) =3/2 PR(b) = PR(a)/C(a)=1/2 PR( c )=PR(a)/C(a)+PR(b)/C(b)=1/2+1/2=1 Dr. Wang

An example b: C(b)=1 PR(b)=1/2 a:C(a)=2 PR(a)=3/2 c: C(c)=1 PR(c)=1 UPDATE: PR(a)=PR( c)/ C (c ) =1 PR(b) = PR(a)/C(a)=3/4 PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/4+1/2=5/4 Dr. Wang

Bringing Order to the Web • Used maps containing as many as 518 million of these hyperlinks. • These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. • For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at google.stanford.edu). • For the type of full text searches in the main Google system, PageRank also helps a great deal. • As reported in “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, by Sergey Brin and Lawrence Page Dr. Wang

Do Densely Connected Sub-graphsrepresent Web Sub-communities? • Inferring Web Communities from Link Topology • http://citeseer.nj.nec.com/36254.html • Efficient Identification of Web Communities • http://citeseer.nj.nec.com/flake00efficient.html • Friends and Neighbors on the Web • http://citeseer.nj.nec.com/adamic01friends.html Dr. Wang

An idea: Complete sub-graphs • there is a group of URLs such that • each URL has a link to every other URL in the group • This is an evidence that each author of the web page is interested in every other web pages in the sub-group Dr. Wang

Another idea: Complete bipartite sub-graphs • Complete Bipartite graph: • two groups of nodes, U and V • for each node u in U and each node v in V • there is an edge from u to v • References • D. Gibson J. Kleinberg, and Raghavan. Inferring web communities from link topology, In Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998. • T. Murata. Finding Related Web Pages based on connectivity information from a search engine. The 10 th WWW Conference Dr. Wang

Problem Description • Suppose one is familiar with some Web pages of specific topic, such as, sports • Problem: to find more pages about the same topic • Web community: entity of related web pages ( centers ) Dr. Wang

Search of fans using a search engine • Use input URLs as initial centers • Search URLs referring to all the centers by backlink search from the centers • Fixed number of high-ranking URLs are selected as fans Dr. Wang

Adding a new URL to centers • Acquire fans’ HTML files through internet • Extract hyperlinks in the HTML files • Sort the hyperlinks in order of frequency • Add Top-ranking hyperlink to centers • Delete fans not referring to all the centers Dr. Wang

Web Community • Repeat previous steps until few fans left • Acquired centers are regarded as a WEB COMMUNITY Dr. Wang

centers fans Web community Web Community Centers: many web pages go there Dr. Wang

Drawbacks • Maximum clique is difficult to find (NP-hard problem) • The rough idea of closely linked URLs is right but completely connected subgraph may not be. It is often the case that some links may be missing • fans may not have hyperlinks to centers that are created after their web pages are created Dr. Wang

Minimum Cut Paradigm • Maximum clique is difficult to find (NP-hard problem) • The rough idea of closely linked URLs is right but completely connected subgraph may not be. It is often the case that some links may be missing • fans may not have hyperlinks to centers that are created after their web pages are created • A minimum cut of a digraph (V,A) is a partition of the node set V into two subsets U and W such that the number of edges from U to W is minimized. • It captures the notion of U and W are NOT closely linked. • Therefore, nodes in U are more closely related than with nodes in W. Dr. Wang

General approach • Find a min-cut using maximum flow algorithm • if the minimum cut is sufficiently large, keep it and report the nodes as a web community • else • remove the edges associated with the minimum cut to split the digraph into two connected components • repeat on each of the two connected component Dr. Wang

a b c j d i e f h g Dr. Wang

a b c j d e f h g Dr. Wang

a b c j d e f g Dr. Wang

Efficient Identification of Web Community • A heuristic implementation of the minimum cut paradigm for web community • Gray William Flake, Steve Lawrence, and C. Lee Giles • Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD2000) pp.150-160, August 2000, Boston, USA Dr. Wang

Problem Description • Given some web pages, • Problem: find a community of related pages. • Community: a set of web pages that link (in either direction) to more web pages in the community than to pages outside the community Dr. Wang

Lecture 6: (Week 6) Web (Social) Structure Mining

Lecture 6: (Week 6) Web (Social) Structure Mining

Presentation Transcript

CS 345A Data Mining Lecture 1

CS 345 Data Mining Lecture 1

Web Mining Research : A Survey

SOC1016A - Lecture 02

Data Mining for Malware Detection Lecture #2 May 27, 2011

Social Structure

What’s social about social housing? (and what’s the law got to do with it anyway?) An Inaugural Lecture delivered by Pr

Frequent Structure Mining

DATA MINING LECTURE 5

Web Mining Research: A Survey

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

FINAL LECTURE

Lecture 11: Graph Data Mining

Text mining and the Semantic Web

CURRENT SITUATION IN THE TURKISH MINING INDUSTRY An approach of social politics

Data Mining

Social structure theories

Data Mining (and machine learning)

Information Retrieval and Text Mining

Soc. 153-Lecture 5

PSY 201 UOP Courses / uoptutorial