360 likes | 547 Views
Communities and Clustering in some Social Networks. Guido Caldarelli SMC CNR-INFM Rome . 1 Introduction on basic notions of graphs and clustering. 2 Introduction on clustering methods based on similarity/centrality. 3 Introduction on clustering methods based on spectral analysis.
E N D
Communities and Clusteringin some Social Networks Guido CaldarelliSMC CNR-INFM Rome
1 Introduction on basic notions of graphs and clustering 2 Introduction on clustering methods based on similarity/centrality 3 Introduction on clustering methods based on spectral analysis 4 The case of study of word association network 5 The case of study of Wikipedia 6 Conclusions and advertisements INTRODUCTION Summary 1 2 3 4 5 Guido Caldarelli, Communities and Clustering in Some social Networks 6 NetSci 2007 New York, May 20th 2007
INTRODUCTION 1.0 Basic matrix notation 1 2 3 4 4 4 1 2 1 2 3 3 5 Guido Caldarelli, Communities and Clustering in Some social Networks 6 NetSci 2007 New York, May 20th 2007
INTRODUCTION 1.1 Clusters and Communities 1 Generally a cluster corresponds to a communitySome communities are hard to detect with clustering analysis 2 3 4 5 Guido Caldarelli, Communities and Clustering in Some social Networks 6 NetSci 2007 New York, May 20th 2007
INTRODUCTION 1.2 Small graphs 1 In order to detect communities, clustering is a good clue 2 • Clustering Coefficient 3 4 5 • Motifs Guido Caldarelli, Communities and Clustering in Some social Networks 6 NetSci 2007 New York, May 20th 2007
INTRODUCTION 1.2 Hubs and Authorities 1 Sometimes vertices differ each other, according to their function 2 • HITS 3 • hubs are those web pages that point to a large number of authorities (i.e. they have a large number of outgoing edges). • authorities are those web pages pointed by a large number of hubs (i.e. they have a large number of ingoing edges). 4 5 Guido Caldarelli, Communities and Clustering in Some social Networks 6 Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46, 604–632. NetSci 2007 New York, May 20th 2007
INTRODUCTION 1.3 Hubs and Authorities 1 If every page i,j, has authority Ui,jand hubness Hij 2 3 4 5 We can divide the pages according to their value of U or H. These values are obtained by the eigenvalues of the matrices ATA and AATrespectively. Guido Caldarelli, Communities and Clustering in Some social Networks 6 NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS 2.1 Agglomerative Methods 1 2 3 4 5 One way to cluster vertices is to find similarites between them. One “topological” way is given by considering their neighbours. One can then define a distance x given by 6 Brun, et al (2003). Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biology, 5, R6 1–13. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS 2.2 Divisive Methods: betweenness 1 2 3 4 5 The Algorithm of Girvan and Newman selects recursively the largest edge-betweenness in the graph 6 The betweenness is a measure of the centrality of a vertex/edge in a graph Girvan, M. and Newman, M.E.J. (2002). Community structure in social and biological networks. Proc. Natl. Acad. of Science (USA), 99, 7821–7826. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS 2.3 Examples 1 The procedure on a more complicated network, produces a dendrogram of the community structure 2 3 (a) friendship network from Zachary’s karate club study (26). Nodes associated with the club administrator’s faction are drawn as circles, those associated with the instructor’s faction are drawn as squares. (b) Hierarchical tree showing the complete community structure. (c) Hierarchical tree calculated by using edge-independent path counts, which fails to extract the known community structure of the network. 4 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS 2.3 Examples 1 2 One typical example is that of the e-mail network. Below the case of study of University of Tarragona (Spain). Different colors correspond to different departments 3 4 5 6 Guimerà, R., Danon, L., Diaz-Guilera, A., Giralt, F., and Arenas, A. (2002). Self-similar community structure in organisations. Physical Review E, 68, 065103. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS 2.4 Random walks and communities 1 Random walks on Graphs are at the basis of the PageRank algorithm (Google). This means that the largest is the probability to pass in a certain page the largest its interest. 2 3 Random walks can also be used to detect clusters in graphs, the idea is that the more closed is a subgraph, the largest the time a random walker need to escape from it. 4 One of the heuristic algorithms based on random walks is the Markov Cluster (MCL) one. You find the complete description and codes at http://micans.org/mcl 5 6 • Start from the Normal Matrix, • through matrix manipulation (power), one obtains a matrix for a n-steps connection. • Enhance intercluster passages by raising the elements to a certain power and then normalize. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS 3.1 The functions of the adjacency matrix 1 2 3 4 4 1 2 3 5 6 Normal Matrix Laplacian Matrix Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS 3.1 The functions of the adjacency matrix 1 Iff’ = Lf 2 3 4 5 6 The elements of matrix N give the probability with which one fieldfpasses from a vertex i to the neighbours. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS 3.2 The block properties in clustered graphs 1 2 3 4 5 6 In a very clustered graph, the adjacency matrix can be put in a block form. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS 3.2 The block properties in clustered graphs 1 • Given this probabilistic explanation for the matrix N • We have a series of results, for example • One eigenvalue is equal to one and • The eigenvector related is constant. 2 3 Consider the case of disconnected subclusters: The matrix N is made of blocks and a general eigenvector will be given by the space product of blocks eigenvectors (the constant can be different!) 4 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS 3.3 Eigenvalues and Communities 1 It is possible to express the eigenvectors problem as a research of a minimum under constraint 2 • Define a ficticious quantity x for the sites of the graph • Define a suitable function z on these x’s (a “distance”) • Define a suitable constraint on these x’s (to avoid having all equal or all 0) 3 4 For example 5 6 where the xi are values assigned to nodes, with some constraint expressed by (A) Stationary points of z(x) + constraint (A) → Lagrange multiplier Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS 3.3 Eigenvalues and Communities 1 2 3 4 5 6 Lagrange Multiplier = Normal Eigenvalue problem Lagrange Multiplier = Laplacian Eigenvalue problem Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WORD ASSOCIATION NETWORK 4.1 The experimental data 1 The data are collected through a psychological experiment: Persons (about 100) are given as a stimulus a single word i.e. “House”. They must answer with the first word that comes on their mind i.e.“Family”. Answer are later given as new stimula, so that a network of average associations forms. 2 3 4 A path from “Volcano” to “Ache” 5 6 Steyvers, M. and Tenenbaum, J.B. (2005). The large scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29, 41–78. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WORD ASSOCIATION NETWORK 4.1 The experimental data 1 2 The number of connections (i.e. the degree of nodes) is power-law distributed 3 4 5 6 Capocci, A., Servedio, V. D. P., Caldarelli, G., and Colaiori, F. (2005). Detecting communities in large networks. Physica A, 352, 669–676.. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WORD ASSOCIATION NETWORK 4.2 The community structure 1 Therefore we expect similar words to be on the same plateau. We can measure the correlation between the values of various vertices averaged over 10 different eigenvectors. 2 3 4 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.1 Introduction 1 http://www.wikipedia.org 2 3 4 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.1 Introduction 1 http://www.wikipedia.org 2 3 4 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.1 Introduction 1 2 3 4 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.1 Introduction 1 A Nature investigation aimed to find if Wikipedia is an authoritative source of information with respect to established sources as Encyclopedia Britannica. 2 3 4 • Among 42 entries tested, the difference in accuracy was not particularly great: • the average science entry in Wikipedia contained around four inaccuracies; • the one in Britannica, about three. • On the other hand the articles on Wikipedia are longer on average than those of Britannica. This accounts for a lower rate of errors in Wikipedia. 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.2 The network properties 1 We generated six wikigraphs, wikiEN, wikiDE, wikiFR, wikiES, wikiIT and wikiPT, generated from the English, German, French, Spanish, Italian and Portuguese datasets, respectively. The graphs were obtained from an old dump of June 13, 2004. We are not using the current data due to disk space restrictions. The English dataset of June 2005 has more than 36 GB compacted, that is about 200 GB expanded. 2 3 4 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.2 The network properties 1 The Degree shows fat tails that can be approximated by a power-law function of the kind P(k) ~ k-g Where the exponent is the same both for in-degree and out-degree. 2 3 4 5 In the case of WWW 2 ≤ gin ≤ 2.1 6 in–degree(empty) and out–degree(filled). Occurrency distributions for the Wikgraph in English (o) and Portuguese (). Capocci, A., et al. (2006). Preferential attachment in the growth of social networks: The internet encyclopedia Wikipedia. Physical Review E, 74, 036116 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.2 The network properties 1 As regards the assortativity (as measured by the average degree of the neighbours of a vertex with degree k) there is no evidence of any assortative behaviour. 2 3 4 5 6 The average neighbors’ in–degree, computed along incoming edges, as a function of the in–degree for the English (o) and Portuguese () Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.3 The growth of Wikipedia 1 Given the history of growth one can verify the hypothesis of preferential attachment. This is done by means of the histogram P(k) who gives the number of vertices (whose degree is k) acquiring new connections at time t. This is quantity is weighted by the factor N(t)/n(k,t) 2 3 4 5 6 We find preferential attachment for in and out degree. English (o) and Portuguese (). White= in-degree Filled = out-degree Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.4 The communities in Wikipedia 1 2 3 4 5 6 Taxonomy Categorization provided gives an imposed taxonomy to the pages. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.3 The Communities in Wikipedia 1 Given different wikigraphs one can compute the frequency of the category sizes in the various systems 2 3 4 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.3 The Communities in Wikipedia 1 Similarly, also the cluster size frequency distribution (computed with MCL algorithm) can be considered 2 3 4 5 6 Qualitatively rather good agreement. But are there the same? Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
WIKIPEDIA 5.3 The Communities in Wikipedia 1 2 3 4 5 6 NOT REALLY! The power-law shape is probably a very common feature for any categorization Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
SUMMARY 1 • Communities represents an important categorization of graphs. • Methods to detect them varies according to the specific case of study • SMALL GRAPHS (motifs, clustering coefficient) • LARGE GRAPHS • FUNCTION OF VERTICES (HITS, Vertex Similarity) • CENTRALITY (Girvan Newman Algorithms) • DIFFUSION ON THE GRAPH • MCL Algorithm • Spectral analysis of the stochastic matrices associated with the graph 2 3 4 5 6 Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007
SHAMELESS ADVERTISEMENT 1 2 3 4 5 Guido Caldarelli, Communities and Clustering in Some social Networks 6 NetSci 2007 New York, May 20th 2007
SHAMELESS ADVERTISEMENT 1 2 3 4 5 6 http://www.complexnetworks.net Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20th 2007