1 / 34

V4 Matrix algorithms and graph partitioning

V4 Matrix algorithms and graph partitioning. Community detection Simple modularity maximization Spectral modularity maximization Division into more than two groups Other algorithms for community detection. Community detection. Basic goal of community detection :

tatum
Download Presentation

V4 Matrix algorithms and graph partitioning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. V4 Matrix algorithmsandgraphpartitioning • Community detection • Simple modularitymaximization • Spectralmodularitymaximization • Division intomorethantwogroups • Other algorithmsforcommunitydetection Mathematics of Biological Networks

  2. Community detection Basic goalofcommunitydetection: Wewantto separate thenetworkintogroupsofverticesthathavefewconnectionsbetweenthem. The importantdifferencetographpartitioningisthat thenumberorsizeofthegroupsisnot fixedanymore. Simplesttask: dividegraphinto 2 groupsorcommunities but withoutanyconstraint on thesizeofthegroups. First idea: choosedivisionwithminimumcutsize. BUT this will not work. An optimal solutionwouldbetosimplyput all vertices intoonegroupandnoneintotheotherone. ThenthecutsizeR iszero. Mathematics of Biological Networks

  3. Community detection Another waywouldbetoimposesomelooseconstraint on thesizesofthegroupsn1andn2. Oneexampleofthiskindisratiocutpartitioningwhere oneminimizestheratio. The denominatorn1 n2hasitslargestvaluewhenbothgroupsareequalsize. 1  9 = 9 This biasestheoptimizationalwaystosolutions 2  8 = 16 wherethegroupshaveroughlyequalsize. 3  7 = 21 4  6 = 24 However, thereis not particularreason 5  5 = 25 behind such an optimizationprinciple. 6  4 = 24 … Mathematics of Biological Networks

  4. Community detection What different measurescouldbeusedtoquantifythequalityof a divisionbesidesthe simple cutsizeoritsvariants? A gooddivisionisonewheretherearefewerthanexpectededgesbetweengroups. → applythemodularitymeasureusedforassortativemixing(see V2). Mathematics of Biological Networks

  5. Review (V2): Quantify assortative mixing Find thefractionofedgesthatrunbetweenverticesofthe same type andsubtractfromthisthefractionofedgeswewouldexpectifedges werepositioned at randomwithoutconsideringthevertex type. ci : classor type ofvertexi , ci [1 … nc] nc: total numberofclasses The total numberofedgesbetweenverticesofthe same type is Here (m,n) istheKroneckerdelta ( is 1 ifm = nand 0 otherwise). The factor ½ accountsforthefactthateveryvertex pair i,jiscounted twice in thesum. Mathematics of Biological Networks

  6. (Review V2): Quantify assortative mixing As expectednumberofedgesbetween all pairsofverticesofthe same type wederived wherethefactor ½ avoids double-countingvertexpairs. Takingthedifferencebetweentheactualandexpectednumberofedgesgives = Typicallyonedoes not calculatethenumberof such edges but thefraction, whichisobtainedbydividingthisbym This quantity Q iscalledthemodularity. Mathematics of Biological Networks

  7. ModularitymaximizationbyKernighan-Lin algorithm We will design an analog oftheKernighan-Lin algorithmwhereweare not requiredtoswappairsofvertices at everystep. Insteadwenowswapsinglevertices. • At eachstep, weselectthevertexwhosemovementwouldmostincrease, or least decrease, themodularity. In a fullcycle, eachvertexismovedexactlyonce. • Thengo back overthestatesthroughwhichthenetworkhaspassedandselecttheonewiththehighestmodularity. • Usethisstateasstartingconditionforanotherroundofthealgorithm. • Repeat thisuntilthemodularitynolongerimproves. The complexityisnowlower, O(mn), becausewhenmovingsingleverticesweonlyhavetoconsiderO(m) possiblemoves at eachstep, in contrasttoO(m2)pairs. Mathematics of Biological Networks

  8. Example: „karateclub“ networkof Zachary ExampleforapplicationofKernighan-Lin modularitymaximization. Pattern offriendshipbetween 34 membersof a karateclub at a US university. At a laterpoint in time, adisputearosebetweenthemembersoftheclubwhethertheclub‘sfeesshouldberaised. This ledto a splittingupoftheclubintotwopartswith 16 and 18 memberseach. The „true“ groups after thedivisionarecoloredblackandwhite in thefiguure. The communitiesdetectedbymodularitymaximizationcorrespondalmostperfectlytotheformedgroups. Mathematics of Biological Networks

  9. Spectralmodularitymaximization There is also an analog ofthespectralgraphpartitioningalgorithmforcommunitydetection. The modularityof a divisioncanberewritteninto: with whereBijhastheproperty Sinceeveryedge in an undirectedgraphhas 2 ends, m edgeshave 2mends. Mathematics of Biological Networks

  10. Spectralmodularitymaximization We will startagainbydividingthenetworkinto 2 partsanddefine Then Substitutingthisintothepreviousequationgives wherewehaveused In matrixnotation, wecanwritethisas B is also calledthemodularitymatrix. Mathematics of Biological Networks

  11. Spectralmodularitymaximization This equationissimilartotheformerexpression in V3 forthecutsizeof a network in termsofthegraphLaplacian. → wecanderive a spectralalgorithmforcommunitydetectionthatiscloselyanalogoustothespectralpartioningmethod. Wewishto find a divisions ofthenetworkthatmaximizesthemodularity Q for a givemodularitymatrixB. The elementsofsareconstrainedtotakevaluesof +1 and -1. Contrarytobefore, thenumberofelementswith +1 and -1 valuesisnow not fixed. As before, we will relax theconstraintthats must pointintothecornersof a hypercubeandallowittopoint in anydirectionsubjecttotheconstraintthatitslengthiskeptthe same: Mathematics of Biological Networks

  12. Spectralmodularitymaximization To find themaximumofwedifferentiateandimposetheconstraintwith a singleLagrangianmultiplier This givesor in matrixnotationB s =  s In otherwords, sisoneoftheeigenvectorsofthemodularitymatrix. Withthissolution, themodularitythenbecomes Formaximummodularity, weshouldchoosestobetheeigenvectoru1correspondingtothelargesteigenvalueofthemodularitymatrix. Mathematics of Biological Networks

  13. Spectralmodularitymaximization But asbefore, wetypicallycannotchooses = u1sincetheelementsofsaresubjecttotheconstraintsi = 1. Insteadwe will do thebestwecanandchoosesasclosetou1aspossiblebymaximizingtheirscalarproduct: where [u1]iisthei-thelementofu1. The maximumisachievedwheneachterm in thesumis non-negative, i.e. when If a vectorelementisexactlyzero, eithervalueofsiisequallygood. Mathematics of Biological Networks

  14. Spectralmodularitymaximization This yieldsthefollowing simple algorithm: • Calculatetheeigenvectorofthemodularitymatrixcorrespondingtothelargest (most positive) eigenvalue • Thenassignverticestocommunitiesaccordingtothesignsofthevectorelements, positive signs in onegroupand negative signs in theother. In practice, thisworksverywell. E.g. thekarateclubisperfectlyclassified. In practicalapplications, itisworthwhiletousespectralmodularitymaximizationas a firststep, followedbytheKernighan-Lin methodtogetsomesmallfurtherimprovements. Mathematics of Biological Networks

  15. Division intomorethan 2 groups In general, networkshave an arbitrarynumberofcommunities. Modularityissupposedtobelargestforthebestdivisionofthenetwork. As firstmethod, wecouldstartbydividingthenetworkinto 2 parts, andthenfurthersubdividethosepartsintosmallerones, and so forth. However, themodularityofthecompletenetworkdoes not break up (asthecutsizedoes) intoindependentcontributionsfromthe separate communities. The individual maximizationofthemodularitiesofthesecommunitiestreatedas separate networks will generally not producethemaximummodularityforthenetworkas a whole. Mathematics of Biological Networks

  16. Division intomorethan 2 groups We must considerexplicitlythechangeQ in themodularityoftheentirenetwork upon furtherbisecting a communitycofsizenc. Bycomparingthemodularity after andbeforethedivision, we find (withoutderivation) B(c)isthenc  ncmatrixwithelements Mathematics of Biological Networks

  17. Division intomorethan 2 groups The equationhasthe same form asbefore. Thus, wecanapplyourspectralapproachtothisgeneralizedmodularitymatrixtomaximizeQ, find theleadingeigenvectoranddividethenetworkaccordingtothesignsofitselements. In repeatedlysubdividing a network in thisway, an importantquestioniswhentostop. The answerissimplywhenweareunableto find a (further) divisionwith a positive changeQ in themodularity. In thatcase, thebisectionalgorithm will put all vertices in oneofits 2 groupsandnone in theother, effectively „refusing“ tofurthersubdividethenetwork. At thispointweshouldstop. Mathematics of Biological Networks

  18. Division intomorethan 2 groups The repeatedbisectionmethodworkswell in manysituations, but itisbynomeansperfect. A particularproblemisthatthereisnoguaranteethatthebestdivisionof a network in, say 3 parts, canbefoundbyfirstfindingthebestdivisioninto 2 partsandthensubdividingoneofthem. • showsthebestsubdivisionofthis linear graphwith 8 verticesand 7 edgesintotwogroupswith 4 verticeseach. • Shows thebestsubdivisioninto 3 groupswith 3, 2, 3 verticeseach. A repeatedbisectionalgorithmwouldnever find solution (b). Mathematics of Biological Networks

  19. Other algorithmsforcommunitydetection An alternative wayoffindingcommunitiesofvertices in a networkistoloookfortheedgesthatliebetweencommunities. Ifwecan find andremovetheseedges, we will beleftwithisolatedcommunities. Onecommonwaytodefine „betweenness“ istousebetweennesscentrality(V1). Mathematics of Biological Networks

  20. Review (V2): Betweenness Centrality Letusassume an undirectednetworkforsimplicity. Letnstibe 1 ifvertexi lies on thegeodesicpathfromstot and 0 ifitdoes not orifthereisno such path. Thenthebetweennesscentralityxiisgivenby This definitioncountsseparatelythegeodesicpaths in eitherdirectionbetween eachvertex pair. The equation also includespathsfromeachvertextoitself. Excludingthosewould not changetherankingofthevertices in termsofbetweenness. Also, weassumethatverticess andt belongtopathsbetweensandt. Iftherearetwogeodesicpathsofthe same lengthbetween 2 vertices, eachpathgets a weightequaltothe inverse ofthenumberofpaths. Vertices A and B are connectedby 2 geodesicpaths. Vertex C lies on both paths. Mathematics of Biological Networks

  21. Review (V2): Betweenness Centrality Wemayredefinenstitobethenumberofgeodesicpathsfromstotthat pass through vertexi, anddefinegsttobethe total numberofgeodesicpathsfromstot. Then, thebetweennesscentralityofxiis whereweadopttheconventionthatnsti/ gst = 0 ifbothnstiandgstarezero. In thissketchof a network, vertex A lies on a bridgejoiningtwogroupsofothervertices. All pathsbetweenthegroups must pass through A, so ithas a high betweennesseventhoughitsdegreeislow. Mathematics of Biological Networks

  22. Useedgebetweennessforcommunitydetection Defineedgebetweennessasthenumberofgeodesicpathsthatrunalongparticularedges. Weexpectthatedgesthatliebetweencommunities will have high valuesofthisedgebetweenness. In thisexample, twoedgesrunbetween thevertices in thetwodashedcircles. All shortestpathsbetweenverticesof thetwogroups will runalongoneof thesetwoedges. Edge betweennessiscomputedbydeterminingthegeodesicpathsbetweenevery pair ofvertices in thenetworkandcounthowmany such pathsgoalongeachedge. This takesO(n(m + n)). Mathematics of Biological Networks

  23. Betweennessalgorithm • Calculatebetweeness score of all edges • Find theedgewiththehighest score andremove it. Becauseremovingthisedge will changethebetweennessscoresofsomeedges, anyshortestpathsthatpreviouslytraversedtheremovededge will nowhavetobererouted. Thus wehavetogo back tostep 1. The progressofthealgorithmcanberepresentedusing a tree. The progressive fragmentationofthe networkasedgedareremovedonebyone isrepresentedbythesuccessivebranching ofthetree. Ifwestop at thedashedline, the networkissplitinto 4 groupsof 6, 1, 2, and 3 vertices. Mathematics of Biological Networks

  24. Hierarchicalclustering Network divisionbyedgebetweennessproduced a dendrogramthatisremiscentofclusteringmethods. Hierarchicalclusteringis an entireclassofalgorithms. Itis an agglomerativetechniquewherewestartwiththe individual verticesof a networkandjointhemtogetherto form groups. Weneed a measureofvertexsimilarity. Forthis, wecanusethemeasuresintroduced in V1 and V2, e.g. • cosinesimilarity, • correlationcoefficientsbetweenrowsoftheadjacencymatrix, • orthe so-calledEuclidiandistance. Havingmanychoicesforsimilaritymeasuresisboth a strengthand a weaknessofhierarchicalclusteringmethods. Itgivesthemethodflexibility, but theresults will differfromoneanother. Mathematics of Biological Networks

  25. Hierarchicalclustering Once a similarityischosenwecalculatethesimilarityof all pairsofvertices in thenetwork. Thenwewould like toconnectthe mostsimilarvertices. However, theremaybeconflicting situations. Should A and C be in the same groupor not? The basicstrategyofhierarchicalclusteringistostartbyjoiningthosepairsofverticeswiththehighestsimilarities. These then form groupsofsize 2. This stepinvolvesnoambiguity. Thenwefurtherjointogetherthegroupsthataremostsimilarto form larger groups, andso on. Mathematics of Biological Networks

  26. Hierarchicalclustering Duringthisprocesswenowrequire a measureforthesimilarityofgroups. Thereare 3 commonwaysofcombiningvertexsimilaritiestogivesimilarityscoresforgroups: - single-linkage - complete-linkage - average-linkageclustering. Consider 2 groupsofvertices, group 1 andgroup 2, withn1andn2vertices, respectively. Thentherearen1n2pairsofvertices such thatonevertexis in group 1 andtheother in group 2. In thesingle-linkageclusteringmethod, thesimilaritybetweenthe 2 groupsisdefinedasthesimilarityofthemostsimilarofthesen1n2pairsofvertices. Mathematics of Biological Networks

  27. Hierarchicalclustering In thesingle-linkageclusteringmethod, thesimilaritybetweenthe 2 groupsisdefinedasthesimilarityofthemostsimilarofthesen1n2pairsofvertices. As theother extreme, complete-linkageclusteringdefinesthesimilaritybetweenthe 2 groupsasthesimilarityoftheleast similarpair ofvertices. In betweenthesetwo extremes istheaverage-linkageclusteringwherethesimilarityoftwogroupsisdefinedtobethemeansimilarityof all pairsofvertices. Mathematics of Biological Networks

  28. Hierarchicalclustering This ishowhierarchicalclusteringisdone: • Choose a similaritymeasureandevaluateitfor all vertexpairs. • Assigneachvertexto a groupofitsown, consistingof just thatonevertex. The initial similaritiesofthegroupsaresimplythesimilaritiesofthevertices • Find the pair ofgroupswiththehighestsimilarityandjointhemtogetherinto a singlegroup. • Calculatethesimilaritybetweenthenewcompositegroupand all othersusingoneofthe 3 methodsabove (single, complete, average-linkage) • Repeat fromstep 3 until all verticeshavebeenjoinedinto a singlegroup. Mathematics of Biological Networks

  29. Hierarchicalclusteringappliedtothekarateclub Partitioningofthekarateclubnetworkbyaveragelinkagehierarchicalclusteringusingcosinesimilarityasourmeasureofvertexsimilarity. Forthisexample, hierarchicalclusteringfoundtheperfectdivisionoftheclub. However, hierarchicalclusteringdoes not alwaysworkaswellashere. Mathematics of Biological Networks

  30. Comparisonofmodularitymaximizationmethods A large numberofapproacheshavebeendevelopedtomaximizemodularityfordivisionsintoanynumberofcommunitiesofanysizes. Danon, Duch, Diaz-Guilera, Arenas, J. Stat. Mech. P09008 (2005) Mathematics of Biological Networks

  31. Comparisonofmodularitymaximizationmethods One way to test the sensitivity of these methods is to see how well a particular method performs when applied to ad hoc networks with a well known, fixed community structure. Such networks are typically generated with n = 128 nodes, split into 4 communities containing 32 nodes each. Pairs of nodes belonging to the same community are linked with probability pin whereas pairs belonging to different communities are joined with probability pout. The value of pout is taken so that the average number of links a node has to members of any other community, zout, can be controlled. While pout(and therefore zout) is varied freely, the value of pin is chosen to keep the total average node degree, k constant, and set to 16. Danon, Duch, Diaz-Guilera, Arenas, J. Stat. Mech. P09008 (2005) Mathematics of Biological Networks

  32. Comparisonofmodularitymaximizationmethods As zout increases, the communities become more and more diffuse and harder to identify, (see figure). Since the “real” community structure is well known in this case, it is possible to measure the number of nodes correctly classified by the method of community identification. Danon, Duch, Diaz-Guilera, Arenas, J. Stat. Mech. P09008 (2005) Mathematics of Biological Networks

  33. Other modularitymaximizationmethods Oneofthemostsuccessfulapproachesissimulatedannealing. The process begins with any initial partition of the nodes into communities. At each step, a node is chosen at random and moved to a different community, also chosen at random. If the change improves the modularity it is always accepted, otherwise it is accepted with a probability exp(Q/kT). The simulation will start at high temperature T and is then slowly cooled down. Several improvements have been tested. Firstly, the algorithm is stopped periodically, or quenched, and Q is calculated for moving each node to every community that is not its own. Finally, the move corresponding to the largest value of Q is accepted. Mathematics of Biological Networks

  34. Comparisonofmodularitymaximizationmethods Danon, Duch, Diaz-Guilera, Arenas, J. Stat. Mech. P09008 (2005) Mathematics of Biological Networks

More Related