310 likes | 459 Views
Probabilistic Models for Discovering E-Communities. Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW 2006. Outline. Introduction Related Works Community-User-Topic Models Semantic Community Discovery Experiments Conclusion. Outline.
E N D
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW 2006
Outline • Introduction • Related Works • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion
Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion
Social Network Analysis (SNA) • SNA is an established field in sociology • The goal of SNA • Discovering interpersonal relationships based on various modes of information carriers, such as emails and the Web • The community graph structure • How social actors gather into groups such that they are intra-group close and inter-group loose • An important characteristic of all SNs
Discovering Community from Email Corpora • Typically the SN is constructed by measuring the intensity of contacts between email users. • An edge indicates a communication between two users is higher than certain frequency threshold • Problematic in some scenarios • A spammer in the email system sends out a lot of messages • The lack of semantic interpretation
Proposed Method • The inner community property within SNs are examined by analyzing the semantic information such as emails • A generative Bayesian network is used to model the generation of communication in an SN • Similarity among social actors are modeled as a hidden layer in the proposed probabilistic model
Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion
Related Work: Document Content Characterization • Several factors, either observable or latent, are modeled as variables in the generative Bayesian network • Topic-Word model • Documents are considered as a mixture of topics • Each topic corresponds to a multinomial distribution over words • Latent Dirichlet Allocation (LDA) [D. Blei et al., 2003]
Related Work (2) • Author-Word model • The author x is chosen randomly from ad [A. McCallum, 1999] • Author-Topic model • Involves both the author and the topic • Perform well for document content characterization [M. Steyvers et al., 2004]
Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion
Community-User-Topic Models (CUT) • Communication document • A document carrier of communication • Basic idea • The issue of a communication document indicates the activities of and is also conditioned on the community structure within an SN • Considering the community as an extra latent variable in the Bayesian network in addition to the author and topic variables
CUT1: Modeling Community with Users (1) • Assume an SN community is more than a group of users • Similar to that assumed in a topology-based method • Treat each community as a multinomial distribution over users
CUT1: Modeling Community with Users (2) • Compute the posterior probability P(c, u, z|w) by computing P(c, u, z, w) • A possible side-effect of CUT1 is it relaxes the community’s impact on the generated topics
CUT2: Modeling Community with Topics (1) • An SN community consists of a set of topics • CUT2 differs from CUT1 in strengthening the relation between community and topic
CUT2: Modeling Community with Topics (2) • Similarly, compute P(c, u, z|w) by computing P(c, u, z, w) • A possible side-effect of CUT2 is it might lead to loose ties between community and users
Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion
Practical Algorithm: Gibbs Sampling • Gibbs sampling is an algorithm to approximate the joint distribution of multiple variables by drawing a sequence of samples • Gibbs sampling is a Markov chain Monte Carlo algorithm and usually applies when the conditional probability distribution of each variable can be evaluated
Estimation of the Conditional Probability • Estimating P(ci, ui, zi|wi) for CUT1 and CUT2 CUT1: CUT2:
EnF-Gibbs: Gibbs Sampling with Entropy Filtering • Non-informative words are ignored after A times of iterations
Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion
Experiment Setup • Data: Enron email dataset • Made public by Federal Energy Regulatory Commission • Fix the number of communities C at 6 and the number of topics T at 20 • The smoothing hyper-parameters α, β and γ were set at 5/T, 0.01 and 0.1 respectively
Experiment Result-1 Table 1: Topics discovered by CUT1 Table 2: Abbreviations
Experiment Result-2 Fig: Communities/topics of an employee
Experiment Result-3 Fig: A community discovered by CUT2
Experiment Result-4 D..steffes = vice president of Enron in charge of government affairs Cara.semperger = a senior analyst Mike.grigsby = a marketing manager Rick.buy = chief risk management officer
Experiment Result-5 • Similarity between two clustering results: Fig: Community similarity comparisons
Experiment Result-6 Fig: Efficiency of EnF-Gibbs
Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion
Conclusion and Future Work • Two versions of Community-User-Topic models are presented for community discovery in SNs. • EnF-Gibbs sampling is introduced by extending Gibbs sampling with entropy filtering • Experiments show that the proposed method effectively tags communities with topic semantics • It would be interesting to explore the predictive performance of these models on new communications between strange social actors in SNs
Illustration of Dirichlet Distribution Several images of the probability density of the Dirichlet distribution when K=3 for various parameter vectors α. Clockwise from top left: α=(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4).