130 likes | 289 Views
ELSEVIER Computational Statistics & Data Analysis February 2007 Genetic clustering of social networks using random walks Aykut Firat , Sangit Chatterjee , Mustafa Yilmaz College of Business Administration, Northeastern University, Boston, MA 02115, USA Presented by Oleg Kolgushev.
E N D
ELSEVIER • Computational Statistics & Data Analysis • February 2007 • Genetic clustering of social networks using random walks • AykutFirat, SangitChatterjee, Mustafa Yilmaz • College of Business Administration, Northeastern University, Boston, MA 02115, USA • Presented by Oleg Kolgushev Presentation: Genetic clustering of social networks using random walks • Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Contents Presentation: Genetic clustering of social networks using random walks • Introduction to Clustering in networks • Random walk based distance measure • Genetic representation • Experiments • Synthetic data creation • Network clustering experiments • Spatial data experiments • Conclusion Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Introduction Presentation: Genetic clustering of social networks using random walks • Popularity of social networks • Mathematical model is a dream. Use heuristic techniques. • Clustering is NP-hard problem. • Genetic algorithm with medoid based representation. • Random walk measure is superior to Euclidian distance. Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Background Presentation: Genetic clustering of social networks using random walks • Network is represented by weighted graph (V,E,w) where w is a measure of similarity between vertices. • Objective is to find decomposition into k-clusters (non-overlapping sub-graphs highly connected vertices) • Random walker will likely to stay inside of a cluster until most of vertices are visited. • Calculating “escape probabilities”. • GA fitness function classifies a node based on sum of edges in a cluster versus sum of edges leading to different sets. Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Random walk based distance Presentation: Genetic clustering of social networks using random walks • Average First time passage m(i,j) • Average Commute Time (ACT) • In matrix and vector multiplication it represented as • Where • ui = [0100…0], L=D-A, A is similarity matrix (wij), e - is a column vector made of [1111…1] , and Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Random walk based distance Presentation: Genetic clustering of social networks using random walks • This measure is appealing for social networks as clustered nodes connect by lots of short paths, clusters are not similar sizes and not spherically shaped. Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Genetic Representation Presentation: Genetic clustering of social networks using random walks • GA is a computer simulation of evolution processes (inheritance, mutation, selection, and crossover). • Representation is a key value • Array of size N (nodes in graph) elements restricted by k (clusters) • k-bins with elements restricted by N (nodes) • k-medoids are clusters represented by one node and other nodes are assigned to the nearest cluster • Possible gene is [3,7] with assignment [{1,2,3,4},{5,6,7,8}] • Small genome, tight clustering. Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Medoid-based representation with exception bins Presentation: Genetic clustering of social networks using random walks • Exception bin contains nodes that do not obey representation by the medoid. • Possible gene [3,7] suggests allocation [{1,2,3,4},{5,6,7,8}] with exception [3,7{5,6},{2}] • Crossover defined by randomly interchanging genes • Mutation is mode of exception creation based on proximity • Fitness function used: inverse of the sum of the distances to the medoids; inverse of the sum of all pair-wise distances within a group; min sum of all pair-wise distances between nodes . Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Experiments Presentation: Genetic clustering of social networks using random walks • How accurate are the clustering results compare to Euclidian distance clustering? • How efficient this approach and what is algorithm complexity? • Synthetic data creation: Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Network clustering experiments Presentation: Genetic clustering of social networks using random walks • Example of 50 nodes network with 6 clusters shown. Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Network clustering experiments Presentation: Genetic clustering of social networks using random walks Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Spacial data experiments Presentation: Genetic clustering of social networks using random walks • Results of transformation and clustering of 150 iris specimens, 50 from each of three species (Fisher’s Iris data) Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21
Conclusion Presentation: Genetic clustering of social networks using random walks • O(n3) limit applicability of random walk distances for large network • Excellent result when number of clusters is known. What k is right? • Superior results compare to Euclidian distances regardless of clustering algorithm used. • Exceptionally good clustering results for representing spacial data as a network when optimum number of nearest neighbors used. Computational Epidemiology Research Lab (CERL) - Department of Computer Science and Engineering - University of North Texas - 2011/03/21