270 likes | 476 Views
DEMON A Local-first Discovery Method For Overlapping Communities. Giulio Rossetti 2,1 ,Michele Coscia 3 , Fosca Giannotti 2 , Dino Pedreschi 2,1 1 Computer Science Dep., University of Pisa, Italy { rossetti,pedre }@ di.unipi.it
E N D
DEMONA Local-first Discovery Method For OverlappingCommunities Giulio Rossetti2,1 ,Michele Coscia3, Fosca Giannotti2, Dino Pedreschi2,1 1 Computer Science Dep., University of Pisa, Italy {rossetti,pedre}@di.unipi.it 2 ISTI - CNR KDDLab, Pisa, Italy {fosca.giannotti, giulio.rossetti}@isti.cnr.it 3 Harvard Kennedy School, Cambridge, MA, US michele_coscia@hks.harvard.edu April 23th 2013
Outline • Problem Definition • Whatis a community? • Community Discovery • Communities and complex (social) networks • A matter of perspective • DEMON Algorithm(s) • Properties • Experiments • Extension • Conclusions
Whatis a community? Unfortunatelydoesnotexist a completelyshareddefinition of what a community is. A generalidea isthat a community represent: “A set of entitieswhereeachentityiscloser, in the network sense, to the otherentitieswithin the community than to the entitiesoutsideit.” or “A set of nodestightlyconnectedwithineachotherthan with nodesbelonging to other sets.”
Community Discovery The aim of CD algorithmsis to identifycommunitieshiddenintocomplex network structure Why Community Discovery? • “Cluster” homogeneousnodesrelying on topologicalinformation • (Clustering networkedentities) Major Problems: • Eachalgorithmmodelsdifferentpropertiesof real world communities • Comparison and evaluation of differentmethodologiesisnottrivial • Found an acceptable compromise betweennumber of communities and theirsizes • ContextDependency
Community DiscoveryApproaches Given the complexity of the problem a number of differenttypologies of approacheswhereproposed, analyzing: • Directed\Undirectededges • Weighted\Unweightededges • Top-Down\Bottom-Up partitioning • Multidimensionality • OverlapamongCommunities • HierarchicalCommunities • … DEMON: Undirected,Bottom-Up, Overlapping (with Directed, Weighted, Hierarchicalextensions)
Outline • Problem Definition • Whatis a community? • Community Discovery • Communities and complex (social) networks • A matter of perspective • DEMON Algorithm(s) • Properties • Experiments • Extension • Conclusions
Communities in (Social) Networks • Communities can be seenas the basicbricks of a (social) network • In simple, small, networks itis easy identifythem by lookingat the structure..
…butreal world networks are not “simple” • Wecan’tidentifyeasilydifferentcommunities • Too manynodes and edges
A Matter of Perspective The only difference is in the scale Locally, for each node the structure makes sense Globally, we are tangled in complex overlaps Idea: a bottom-up approach!
Outline • Problem Definition • Whatis a community? • Community Discovery • Communities and complex (social) networks • A matter of perspective • DEMON Algorithm(s) • Properties • Experiments • Extension • Conclusions
Reducing the complexity Real Networks are Complex Objects Can wemakethem “simpler”? Ego-Networks (networks buildedupon a focalnode, the "ego”, and the nodes to whomegoisdirectlyconnected to plus the ties, ifany, among the alters)
DEMON Algorithm • For eachnoden: • Extract the Ego Network of n • Removen from the Ego Network • Perform a Label Propagation1 • Insertnin each community found • Update the raw community set C • For eachraw community c in C • Merge with “similar” ones in the set (given a threshold)(i.e. merge iff at most the ε% of the smaller one is not included in the bigger one) 1 UshaN. Raghavan, R ́eka Albert, and SoundarKumara. Near linear time algorithm to detect community structures in large-scale networks. PhysicalReview E
Label Propagation – The idea • Eachnodehas an uniquelabel (i.e. its id) • In the first (setup) iterationeachnode, with probability α, changeitslabel to one of the labels of itsneighbors; • At eachsubsequentiterationeachnodeadoptaslabel the oneshared(at the end of the previousiteration) by the majorityof itsneighbors; • We iterate untillconsensusisreached.
Label Propagation – Discussion • Why Label Propagation? • Quasi-linear algorithm • Share our idea of what a community is • Problem: • Ping-Pong effect(the algorithmisnon-deterministic) • Solution • Multilabelallowed(weneedoverlappingcommunitiesafterall…)
DEMON - Twoniceproperties • Incrementality:Given a graph G, an initialset of communities C and an incremental update ∆G consisting of new nodes and new edgesadded to G, where ∆G contains the entire ego networks of all new nodes and of all the preexistingnodesreached by new links, then DEMON(∆G∪ G,C) = DEMON(∆ G, DEMON(G,C)) • Compositionality:Consideranypartition of a graph G intotwosubgraphs G1, G2 suchthat, for anynode v of G, the entire ego network of v in G isfullycontainedeither in G1 or G2. Then, given an initial set of communities C: DEMON(G1∪ G2,C) = Max(DEMON(G1,C), DEMON(G2,C)) Thosepropertymakes the algorithmhighlyparallelizable: itcan runindependently on differentfragments of the overall network with a relativelysmall combination work
Experiments • Networks (with metadata): • Congress (nodes US politicians, connected if they co-sponsor the same bills) • IMDb(nodes Actors, connected if they play in the same movies) • Amazon (nodes Products, connected if they were purchased together) • Compared Algorithms: • Infomap, non-overlapping state-of-the-art • Rosvall and Bergstrom “Maps of random walks on complex networks reveal community structure”, PNAS, 2008 • HLC, overlapping state-of-the-art • Ahn, Bagrow and Lehmann “Link communities reveal multiscale complexity in networks”, Nature, 2010
Quality Evaluation – Community size • number of communities • average community size Amazon
Quality Evaluation - Label Prediction • MultilabelClassificator(BRL, Binary Relevance Learner) • Community memberships of a node as known attributes, real world labels (qualitative attributes) target to be predicted; Congress IMDb
QualityEvaluation - Community Cohesion • How goodisour community partition in describingreal world knowledgeabout the clusteredentities? • “Similarnodes share more qualitative attributesthan dissimilar nodes” Iff CQ(P)>1 we are groupingtogethersimilarnodes
HDemon – Hierarchical merge WhyHierarchical merge? • Classic DEMON Mergefunctiondidnot scale well • Complexityissue (~O(|C|2)) • Bottleneck for huge networks (suchas social graphs) • Weneed to find the right granularityfor the communities • Extensions needed for Label PropagationAlgorithm: • Weightednetworks • Directednetworks (notyetused)
HDemon – Hierarchical merge HDemon(Graph G) Cc = connectedComponent(G) C = ExtractCommunities(G) while (|C|>Cc) For c in C: N <- N ∪make_node(c) For (n,m) in N: If (n share nodes with m): E <- E ∪ (n,m) C <- ExtractCommunities(new Graph(N,E)) ExtractCommunities(Graph G) Egos <- EgoNetworks(G) for e in Egos: C = C ∪ LabelPropagation(e) return C 1 1 1 1 1 1 1 1 2 1 1 1
Outline • Problem Definition • Whatis a community? • Community Discovery • Communities and complex (social) networks • A matter of perspective • DEMON Algorithm(s) • Properties • Experiments • Extension • Conclusions
Future works – Framework structure HFDemon(Graph G) Cc ← |connectedComponent(G)| C ←ExtractCommunities(G) while (|C|>Cc) For c in C: N←N∪ make_node(c) For (n,m) in N: Ifn share nodes with m E ← E ∪ (n,m) C ←ExtractCommunities(new Graph(N,E)) ExtractCommunities(Graph G) C ←<OverlappingCD(G)> returnC Differentscenariosmayrequirerequires alternative communities “definitions”. • Framework for Bottom-up (and overlapping) CD • Regular vs. Hierarchical FDemon(Graph G) C ← <OverlappingCD(e)> Forall c in C(v) C ← Merge(C,c,merging_function) returnC
Future Works – Social Community Evolution Thesisproposal “Evolution in Social Networks” Idea: • Social networks are notstaticobjects • Nodes, Edges can appear and disappear • The sameinteractioncouldoccur multiple times • Communitieschangesconsequently Major Problems • Size and granularity of the communitiesinfluencehevily evolutive models • Hierarchicalmerging? • Which are the nodes prone to leave\join a communities? • Roleidentification • How “strong” is a community? • Community strengthmeasure • Community life-cycle
Conclusions • DEMON approaches the community discovery problem trough the analysis of simple network sub-structures (ego-networks) • Overlapping and Hierarchical algorithms are guided by a social perspective • DEMON outperforms state-of-the-art methodologies • Possible parallel implementation: high scalability
Thanks! Questions? Code available @ http://kdd.isti.cnr.it/~giulio/demon/ (extensions coming soon!)