Probabilistic Models for Discovering E-Communities

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW 2006

Outline • Introduction • Related Works • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion

Outline • Introduction • Related Work • Community-User-Topic Models • Semantic Community Discovery • Experiments • Conclusion

Social Network Analysis (SNA) • SNA is an established field in sociology • The goal of SNA • Discovering interpersonal relationships based on various modes of information carriers, such as emails and the Web • The community graph structure • How social actors gather into groups such that they are intra-group close and inter-group loose • An important characteristic of all SNs

Discovering Community from Email Corpora • Typically the SN is constructed by measuring the intensity of contacts between email users. • An edge indicates a communication between two users is higher than certain frequency threshold • Problematic in some scenarios • A spammer in the email system sends out a lot of messages • The lack of semantic interpretation

Proposed Method • The inner community property within SNs are examined by analyzing the semantic information such as emails • A generative Bayesian network is used to model the generation of communication in an SN • Similarity among social actors are modeled as a hidden layer in the proposed probabilistic model

Related Work: Document Content Characterization • Several factors, either observable or latent, are modeled as variables in the generative Bayesian network • Topic-Word model • Documents are considered as a mixture of topics • Each topic corresponds to a multinomial distribution over words • Latent Dirichlet Allocation (LDA) [D. Blei et al., 2003]

Related Work (2) • Author-Word model • The author x is chosen randomly from ad [A. McCallum, 1999] • Author-Topic model • Involves both the author and the topic • Perform well for document content characterization [M. Steyvers et al., 2004]

Community-User-Topic Models (CUT) • Communication document • A document carrier of communication • Basic idea • The issue of a communication document indicates the activities of and is also conditioned on the community structure within an SN • Considering the community as an extra latent variable in the Bayesian network in addition to the author and topic variables

CUT1: Modeling Community with Users (1) • Assume an SN community is more than a group of users • Similar to that assumed in a topology-based method • Treat each community as a multinomial distribution over users

CUT1: Modeling Community with Users (2) • Compute the posterior probability P(c, u, z|w) by computing P(c, u, z, w) • A possible side-effect of CUT1 is it relaxes the community’s impact on the generated topics

CUT2: Modeling Community with Topics (1) • An SN community consists of a set of topics • CUT2 differs from CUT1 in strengthening the relation between community and topic

CUT2: Modeling Community with Topics (2) • Similarly, compute P(c, u, z|w) by computing P(c, u, z, w) • A possible side-effect of CUT2 is it might lead to loose ties between community and users

Practical Algorithm: Gibbs Sampling • Gibbs sampling is an algorithm to approximate the joint distribution of multiple variables by drawing a sequence of samples • Gibbs sampling is a Markov chain Monte Carlo algorithm and usually applies when the conditional probability distribution of each variable can be evaluated

Gibbs Sampling for CUT

Estimation of the Conditional Probability • Estimating P(ci, ui, zi|wi) for CUT1 and CUT2 CUT1: CUT2:

EnF-Gibbs: Gibbs Sampling with Entropy Filtering • Non-informative words are ignored after A times of iterations

Experiment Setup • Data: Enron email dataset • Made public by Federal Energy Regulatory Commission • Fix the number of communities C at 6 and the number of topics T at 20 • The smoothing hyper-parameters α, β and γ were set at 5/T, 0.01 and 0.1 respectively

Experiment Result-1 Table 1: Topics discovered by CUT1 Table 2: Abbreviations

Experiment Result-2 Fig: Communities/topics of an employee

Experiment Result-3 Fig: A community discovered by CUT2

Experiment Result-4 D..steffes = vice president of Enron in charge of government affairs Cara.semperger = a senior analyst Mike.grigsby = a marketing manager Rick.buy = chief risk management officer

Experiment Result-5 • Similarity between two clustering results: Fig: Community similarity comparisons

Experiment Result-6 Fig: Efficiency of EnF-Gibbs

Conclusion and Future Work • Two versions of Community-User-Topic models are presented for community discovery in SNs. • EnF-Gibbs sampling is introduced by extending Gibbs sampling with entropy filtering • Experiments show that the proposed method effectively tags communities with topic semantics • It would be interesting to explore the predictive performance of these models on new communications between strange social actors in SNs

Illustration of Dirichlet Distribution Several images of the probability density of the Dirichlet distribution when K=3 for various parameter vectors α. Clockwise from top left: α=(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4).

Probabilistic Models for Discovering E-Communities

Probabilistic Models for Discovering E-Communities

Presentation Transcript

Statistical Models for Probabilistic Forecasting

Probabilistic models

Probabilistic Models

Active Learning for Probabilistic Models

Temporal Probabilistic Models

Temporal Probabilistic Models

Probabilistic Models

Probabilistic Models for Parsing Images

Probabilistic graphical models

Probabilistic Graphical Models

Discovering Structural Models

Probabilistic Graphical Models

Probabilistic Models for Relational Data

Probabilistic Models

Probabilistic Models

Probabilistic Topic Models

Probabilistic Graphical Models

Probabilistic Models

Probabilistic Models

Probabilistic models

Discovering Causal Models

Probabilistic Topic Models