1 / 33

Web Communities

Web Communities. Prasanna Desikan (06/13/2002). Definition. Web community: Groups of individuals who share common interests, together with the web pages most popular among them. Web page collections with a shared topic. Types of Communities. Explicitly- defined.

sema
Download Presentation

Web Communities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Communities Prasanna Desikan (06/13/2002)

  2. Definition Web community: • Groups of individuals who share common interests, together with the web pages most popular among them. • Web page collections with a shared topic.

  3. Types of Communities • Explicitly- defined. • Communities that manifest themselves as newsgroups or as resource collections on directories such as Yahoo! • Implicitly- defined. • Communities that result from nature of content-creation of the web.

  4. Terms and Definitions • Directed Bipartite Graph: A graph whose nodes set can be partitioned into two sets F and C, and every directed edge in the graph is from a node u in F to a node v in C.

  5. Terms and Definitions • Completed Bipartite Graph: A bipartite graph that contains all possible edges between a vertex of F and a vertex of C. • Core: A complete bipartite sub-graph with at least inodes from F and at least j nodes from C. • In the web world, the i pages the contains the links are referred to as ‘fans’ and the j pages that are referenced as ‘centers’.

  6. Inferring Web Communities From Link Topology • Community is a core of central authoritative pages linked together by hub pages. • Identify communities corresponding to the principal and non-principal eigenvectors discovered by HITS. • For communities on broad topics: the grouping of pages discovered is relatively independent of the exact choice of root set.

  7. Inferring Web Communities From Link Topology Findings on Structure of Communities. • Robustness: For broad topics, HITS produces stable, robust communities. • Topic Generalization: HITS tend to generalize topics that are not broad. • “Michael Jordan” produces links to pages on MJ and his team. • “Dennis Ritchie” produces links that reference to “C – Programming Language.”

  8. Inferring Web Communities From Link Topology • Other Generalization: HITS tends to converge on topics with greater density of linkage. • E.g for a query on “linguistics”, the top authorities are focused on a sub-topic “computational linguistics” because of its greater density of linkage on web. • Temporal Issues: For obtaining long-term “core” of a topic, we can superimpose the results of HITS on the same topic, spaced-out several month periods.

  9. Trawling the Web for Emerging Web Communities • Trawling: Systematic Enumeration of emerging communities from web crawl. • Scan through a web crawl and identify all instances of graph structures that are indicative signatures of communities.

  10. Trawling the Web for Emerging Web Communities Data Source: A copy of web from Alexa. Pre-processing data. • Identify potential fan pages (a page that has links to at least six different websites) – out of 200 million pages around 24 million were extracted. • Eliminate mirrors (out of 24 million it removed around 60% of pages.

  11. Trawling the Web for Emerging Web Communities • Prune by in-degree. • Eliminate all pages that have an in-degree greater than a threshold value k. k is set as 50 in the experiments. • Iterative pruning. • When looking for (i,j) cores any potential fan with out-degree smaller than j can be pruned and the corresponding edges deleted from the graph.

  12. Trawling the Web for Emerging Web Communities • Inclusion-exclusion pruning. • Let {c1,c2,…..,cj} be centers adjacent to a fan x. • N(ct) = neighborhood of ct, the set of fans that point to ct. • x is a part of core if and only if the intersection of sets N(ct) has size at least i. • Filter nepotistic cores.

  13. Trawling the Web for Emerging Web Communities Evaluation of Communities. • Fossilization: 30% of communities were fossilized. • A fossil is a community core not all of whose fans exist on the web today. • Reliability: Only 4% of the trawled cores were coincidental i.e a collection of fan pages without any cogent theme unifying them.

  14. Trawling the Web for Emerging Web Communities • Quality: 56% were not in Yahoo as constructed from the crawl. And 29% were not in Yahoo at the time of the paper. • This indicates identification of emerging communities by trawling.

  15. Self Organization and Identification of Web Communities • Web community is defined as a collection of web pages such that each member page has more hyperlinks (in either direction) within the community than outside of the community. • Approach: Maximal Flow – Minimal Cut framework. • Benefits: Focused crawling, automatic population of portal categories.

  16. A Simple Community Identification Example Figure : Maximum Flow methods will separate the two subgraphs with any choice of s and t that has s on the left subgraph and t on the right subgraph, removing the three dashed links.

  17. Approximate Flow Community

  18. Exact Flow Community

  19. Exact Flow Community • An artificial source ‘s’, is added with infinite capacity edges routed to all seed vertices in S. • Each pre-existing edge is made bi-directional and rescaled to a constant value k.

  20. Exact Flow Community • All vertices except the source, sink, and seed vertices are routed to the artificial sink with unit capacity. • A residual flow graph is produced by a maximum flow procedure. • All vertices accessible from s through non-zero positive edges form the desired result and satisfy our definition of a community.

  21. Sample Results From Community Identification The scores are the total number of inbound and outbound links that a web page has to other pages that are also in the community.

  22. Characterization of Communities Table 3: The fifteen most significant text features for each community, sorted in descending order of the Kullback-Leibler metric.

  23. Discovering Seeds of New Interest Spread From Premature Pages. • A method for discovering topics, which stimulate communities of people into earnest communications on the topics’ meaning, and grow into a trend of popular interest. • Community is a group of people sharing some value.

  24. Agora Methodon Links • Archive page - Page of highest rank according to Google in a community. • Agora Pages -Pages linked from multiple archive-pages but are not in any community themselves are taken as novel topics attracting multiple communities, called agora-topic pages.

  25. Agora Methodon Links • Step 1: A query representing user’s interest domain is entered to a search engine (Google here, obtaining 105 to 106 pages). • Step 2: Communities, of pages obtained in Step 1, are obtained and archive-pages are selected from communities.

  26. Agora Method on Links • Step 3: Pages, not in the communities but linked from multiple archive-pages, are obtained as agora-pages. Having all obtained results by here, archive pages (black nodes), agora-pages (red nodes) and the links between them are visualized as in Fig.1.

  27. Fig: The output of Agora on Links, for domain query “Human Genome”

  28. Evaluation • Stage 1. An interest domain is fixed, a group of people relevant to the domain gathered, and the domain-name is input as a query (e.g. ”information retrieval”). • Stage 2. The output graph adding real and fake red nodes, as if they all were really obtained as agora-pages, is shown to the subjects. That is, some red nodes, not really obtained, were added with red links to black archive-nodes. Subjects reported individual impressions and exchanged ideas in the group.

  29. Sample Results • Institutes in ‘red’ were the ones who have data sources of human or mouse genomes, and is useful for researchers in other institutes to look at those data. • 8 of the 12 ‘red’ nodes were termed as “interesting for thinking of future work” by the subjects.

  30. References • [1]D. Gibson, J. Klienberg, and P.Raghavan. Inferring web communties from link topology. In Proc. 9th ACM Conference on Hypertext and Hypermedia. • [2]Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Trawling the web for emerging cyber-communities. In Proc 8th Int. World Wide Web Conf.,1999. • [3] Gary William Flake, Steve Lawrence, C. Lee Giles . Efficient Identification of Web Communities. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. • [4] Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee. Self-Organization and Identification of Web Communities. IEEE Computer, 35(3), 66–71, 2002.

  31. References • [5] Naohiro Matsumura , Yukio Ohsawa , Mitsuru Ishizuka Discovering Seeds of New Interest Spread from Premature Pages Cited by Multiple Communities, 2001 International Conference on Web Intelligence.

  32. Kullback-Leibler Metric • Let p and q be probability distributions with support X and Y respectively. The relative entropy or Kullback-Liebler distance between two probability distributions p and q is defined as Back

  33. Back

More Related