1 / 29

Community Detection with Edge Content in Social Media Networks

Community Detection with Edge Content in Social Media Networks . Paper presented by Konstantinos Giannakopoulos. Outline. Definitions Social Networks & Big Data Community Detection The framework of Matrix Factorization algorithms. Steps, Goals, Solution The PCA approach

diamond
Download Presentation

Community Detection with Edge Content in Social Media Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos

  2. Outline • Definitions • Social Networks & Big Data • Community Detection • The framework of Matrix Factorization algorithms. • Steps, Goals, Solution • The PCA approach • The EIMF algorithm • Description, Performance Metrics, Evaluation • Other Approaches • Algorithms, Models, Metrics

  3. From Social Networks to Big Data Network Social Network BIG DATA

  4. Social Networks • Users act (conversations, like, share) • Users are connected

  5. Community Detection Density of Links Links and Content Some of the less strongly linked vertices may belong to the same community if they share similar content

  6. General Methodology of MF models • Decomposition of a matrix into a product of matrices. • M: A matrix representation of the social network. M:[m x n] A:[m x k] B:[k x n] = • Product of two low-rank matrices • k-dimensional feature vector

  7. What’s next? • Two sub-models: • Link matrix factorization FL • Content matrix factorization FC Each matrix contains a k- dimensional feature vector. F = min ||M − P ||2F + regulation term • P: Product of matrices that approximate M • Content is incorporated in FC using: • cosine similarity • norm Laplacian matrix. • Regulation term • improves robustness • Prevents overfitting.

  8. Goal & Solution • Goal: To find an optimum representation of the latent vectors. Optimization problem. • Frobeniusnorm ||.||2F measures the discrepancy between matrices • When FLand FCare convex functions, minimization problem is solved using • conjugate gradient or quasi-Newton methods. • Then, FLand FCare incorporated into one objective function that is usually a convex function too. • Obtain high quality communities. • use of traditional classifiers like k-means, or SVMs

  9. The PCA Algorithm • State-of-Art method for this model: • PCA (similar to LSI). Optimization problem: min||M−ZUT||2 + γ||U|| • Z: [n × l] matrix with each row being the l-dim feature vector of a node, • U: [l × n] matrix, and • ||.||2F : the Fobenius norm. Goal: Approximate M by Z UT, a product of two low-rank matrices, with a regulalizationon U.

  10. Edge-Induced Matrix-Factorization (EIMF) • The partitioning of the edge set into k communities which are based both on their linkages and content. • Edges : latent vector space based on link structure. • Content is incorporated into edges, so that the latent vectors of the edges with similar content are clustered together. • Two Objective Functions • Linkage-based connectivity/density, captured by Ol • Content-based similarity among the messages, OC

  11. Ol: link structure for any vertex and its incident edges • Approximate Link Matrix Γ: [m x n] Ol(E)=||ETV−Γ||2F or Ol(E)=||ETE∆−Γ||2F E:[k x m] V:[k x n]

  12. Oc: link incorporating edge content • For each edge, the content is associated with it. • Each document is represented w/ a d-dim feature vector. • Cosine Similarity: Similarity measure of two corresponding feature vectors: • Normalized Laplacian matrix: To minimize the content-based objective function. Oc(E) = minEtr(ET ·L·E) C: [d x m]

  13. To Sum Up • Two Objective Functions: • Linkage based connectivity/density. link structure for any vertex and its incident edges is: Ol(E) = ||ETV−Γ||2F • Content-based similarity among text documents. Oc(E) = minEtr(ET ·L·E) • Goal • Minimize the objective function O(E ) = Ol(E ) + λ · Oc(E ) • Solution • Convex functions => no local minimum => Gradient • Apply k-means for the detection of final communities

  14. Experiments • Characteristics of the Datasets • Enron Email Dataset • #of messages: 200,399 • #of users: 158 • #of communities: 53 • Flickr Social Network Dataset • #of users: 4.703 • #of communities: 15 • #of images: 26,920

  15. Performance Metrics • Supervised • Precision: • The fraction of retrieved docs that are relevant. • eg. high precision: Every result on first page to be relevant. • Recall: • The fraction of relevant docs that are retrieved. • eg. Retrieve all the relevant results. • Pairwise F-measure: • A higher value suggests that the clustering is of good quality.

  16. Performance Metrics • Average Cluster Purity (ACP) • The average percentage of the dominant community in the different clusters.

  17. Evaluation • Four sets of experiments with other algorithms • Link only • Newman • LDA-Link • Content • LDA-Word • NCUT-Content • Link + node content • LDA-Link-Word • NCUT-Link-Content • Link + edge content • EIMF-Lap • EIMF-LP • Tuning the balancing parameter λ

  18. Strong/Weak points • Strong Points • Incorporation of content messages to link connectivity. • Detection of overlapping communities. • Weak Points • Tested mainly on email datasets (directed communication) and on dataset with tags. Not on a social network (broadcast communication). • Experiments do not see it as a unified model.

  19. More Link-Based Algorithms • Modularity Measures the strength of division of a network into modules. High modularity => dense inner connections & sparse outer connections.

  20. k1 = k2 = 1, k3 = k4 = k5 = 2, M = 2|E| = 8, Pij = (kikj) / (M)

  21. Even More Link-based Algorithms • Betweenness Measures a node’s centrality in a network. It is the number of shortest paths from all vertices to all others that pass through that node. • Normalized Cut (Spectral Clustering) Using the eigenvalues of the similarity matrix of the data points to perform dimensionality reduction before clustering in fewer dimensions. It partitions points in two sets based on the eigenvector corresponding to the second-smallest eigenvalue of the normalized Laplacianmatrix of S where D is the diagonal matrix • Clique-based • PHITS

  22. Other Algorithms • Node-Content • PLSA Probabilistic model. Data is observations that arise from a generative probabilistic process that includes hidden variables. Posterior inference to infer the hidden structure. • LDA Each content is a mixture of various topics. • SVM (on content and/or on links-content) A vector-based method that finds a decision boundary between two classes. • Combined Link Structure and Node-Content Analysis • NCUT-Link-Content • LDA-Link-Content

  23. Other Community Detection Models • Discriminative • Given a value vector c that the model aims to predict, and a vector x that contains the values of the input features, the goal is tofind the conditional distribution p(c | x). • p(c | x) is described by a parametric model. • Usage of Maximum Likelihood Estimation technique for finding optimal values of the model parameters. • State-of-art approach: PLSI

  24. Generative • Given some hidden parameters, it randomly generates data. The goal is to find the joint probability distribution p(x, c). • the conditional probability p(c|x) can be estimated through the joint distribution p(x, c). e.g. P (c, u, z, ω) =P(ω|u)P(u|z)P(z|c)P(c) • State-of-art approach: LDA

  25. Bayesian Models • Estimate prior distributions for model parameters. (e.g. Dirichlet distribution with Gamma function, Beta distribution). • Estimate the Joint probability of the complete data. • A Bayesian inference framework is used to maximize the posterior probability. The problem is intractable, thus optimization is necessary. • Apply Gibbs sampling approach for parameter estimation, to compute the conditional probability. • Compute statistics with initial assignments. • For each iteration and for each node: • Estimate the objective function. • Sample the community assignment for node i according to the above distribution. • Update the statistics.

  26. Additional Evaluation Metrics • Normalized Mutual Information NMI • The average percentage of the dominant community in different clusters. • Modularity • NCUT

  27. Additional Evaluation Metrics • Perplexity • A metric for evaluating language models(topic models). • A higher value of perplexity implies a lesser model likelihood and hence lesser generative power of the model.

  28. Comparative Analysis • Three models • MF, Discriminative (D), Generative (G) • Parameter Estimation • Objective Function min (MF) • Frobenius norm, Cosine similarity, Laplacian norm, quasi-Newton. • EM & MLE (D) • Gibbs Sampling (Entropy-based, Blocked) (G) • Metrics • PWF, ACP (MF) • NMI, PWF, Modu (D) • NMI, Modu, Perplexity, Runnning Time, #of iterations (G) m

More Related