340 likes | 500 Views
V4 Matrix algorithms and graph partitioning. Community detection Simple modularity maximization Spectral modularity maximization Division into more than two groups Other algorithms for community detection. Community detection. Basic goal of community detection :
E N D
V4 Matrix algorithmsandgraphpartitioning • Community detection • Simple modularitymaximization • Spectralmodularitymaximization • Division intomorethantwogroups • Other algorithmsforcommunitydetection Mathematics of Biological Networks
Community detection Basic goalofcommunitydetection: Wewantto separate thenetworkintogroupsofverticesthathavefewconnectionsbetweenthem. The importantdifferencetographpartitioningisthat thenumberorsizeofthegroupsisnot fixedanymore. Simplesttask: dividegraphinto 2 groupsorcommunities but withoutanyconstraint on thesizeofthegroups. First idea: choosedivisionwithminimumcutsize. BUT this will not work. An optimal solutionwouldbetosimplyput all vertices intoonegroupandnoneintotheotherone. ThenthecutsizeR iszero. Mathematics of Biological Networks
Community detection Another waywouldbetoimposesomelooseconstraint on thesizesofthegroupsn1andn2. Oneexampleofthiskindisratiocutpartitioningwhere oneminimizestheratio. The denominatorn1 n2hasitslargestvaluewhenbothgroupsareequalsize. 1 9 = 9 This biasestheoptimizationalwaystosolutions 2 8 = 16 wherethegroupshaveroughlyequalsize. 3 7 = 21 4 6 = 24 However, thereis not particularreason 5 5 = 25 behind such an optimizationprinciple. 6 4 = 24 … Mathematics of Biological Networks
Community detection What different measurescouldbeusedtoquantifythequalityof a divisionbesidesthe simple cutsizeoritsvariants? A gooddivisionisonewheretherearefewerthanexpectededgesbetweengroups. → applythemodularitymeasureusedforassortativemixing(see V2). Mathematics of Biological Networks
Review (V2): Quantify assortative mixing Find thefractionofedgesthatrunbetweenverticesofthe same type andsubtractfromthisthefractionofedgeswewouldexpectifedges werepositioned at randomwithoutconsideringthevertex type. ci : classor type ofvertexi , ci [1 … nc] nc: total numberofclasses The total numberofedgesbetweenverticesofthe same type is Here (m,n) istheKroneckerdelta ( is 1 ifm = nand 0 otherwise). The factor ½ accountsforthefactthateveryvertex pair i,jiscounted twice in thesum. Mathematics of Biological Networks
(Review V2): Quantify assortative mixing As expectednumberofedgesbetween all pairsofverticesofthe same type wederived wherethefactor ½ avoids double-countingvertexpairs. Takingthedifferencebetweentheactualandexpectednumberofedgesgives = Typicallyonedoes not calculatethenumberof such edges but thefraction, whichisobtainedbydividingthisbym This quantity Q iscalledthemodularity. Mathematics of Biological Networks
ModularitymaximizationbyKernighan-Lin algorithm We will design an analog oftheKernighan-Lin algorithmwhereweare not requiredtoswappairsofvertices at everystep. Insteadwenowswapsinglevertices. • At eachstep, weselectthevertexwhosemovementwouldmostincrease, or least decrease, themodularity. In a fullcycle, eachvertexismovedexactlyonce. • Thengo back overthestatesthroughwhichthenetworkhaspassedandselecttheonewiththehighestmodularity. • Usethisstateasstartingconditionforanotherroundofthealgorithm. • Repeat thisuntilthemodularitynolongerimproves. The complexityisnowlower, O(mn), becausewhenmovingsingleverticesweonlyhavetoconsiderO(m) possiblemoves at eachstep, in contrasttoO(m2)pairs. Mathematics of Biological Networks
Example: „karateclub“ networkof Zachary ExampleforapplicationofKernighan-Lin modularitymaximization. Pattern offriendshipbetween 34 membersof a karateclub at a US university. At a laterpoint in time, adisputearosebetweenthemembersoftheclubwhethertheclub‘sfeesshouldberaised. This ledto a splittingupoftheclubintotwopartswith 16 and 18 memberseach. The „true“ groups after thedivisionarecoloredblackandwhite in thefiguure. The communitiesdetectedbymodularitymaximizationcorrespondalmostperfectlytotheformedgroups. Mathematics of Biological Networks
Spectralmodularitymaximization There is also an analog ofthespectralgraphpartitioningalgorithmforcommunitydetection. The modularityof a divisioncanberewritteninto: with whereBijhastheproperty Sinceeveryedge in an undirectedgraphhas 2 ends, m edgeshave 2mends. Mathematics of Biological Networks
Spectralmodularitymaximization We will startagainbydividingthenetworkinto 2 partsanddefine Then Substitutingthisintothepreviousequationgives wherewehaveused In matrixnotation, wecanwritethisas B is also calledthemodularitymatrix. Mathematics of Biological Networks
Spectralmodularitymaximization This equationissimilartotheformerexpression in V3 forthecutsizeof a network in termsofthegraphLaplacian. → wecanderive a spectralalgorithmforcommunitydetectionthatiscloselyanalogoustothespectralpartioningmethod. Wewishto find a divisions ofthenetworkthatmaximizesthemodularity Q for a givemodularitymatrixB. The elementsofsareconstrainedtotakevaluesof +1 and -1. Contrarytobefore, thenumberofelementswith +1 and -1 valuesisnow not fixed. As before, we will relax theconstraintthats must pointintothecornersof a hypercubeandallowittopoint in anydirectionsubjecttotheconstraintthatitslengthiskeptthe same: Mathematics of Biological Networks
Spectralmodularitymaximization To find themaximumofwedifferentiateandimposetheconstraintwith a singleLagrangianmultiplier This givesor in matrixnotationB s = s In otherwords, sisoneoftheeigenvectorsofthemodularitymatrix. Withthissolution, themodularitythenbecomes Formaximummodularity, weshouldchoosestobetheeigenvectoru1correspondingtothelargesteigenvalueofthemodularitymatrix. Mathematics of Biological Networks
Spectralmodularitymaximization But asbefore, wetypicallycannotchooses = u1sincetheelementsofsaresubjecttotheconstraintsi = 1. Insteadwe will do thebestwecanandchoosesasclosetou1aspossiblebymaximizingtheirscalarproduct: where [u1]iisthei-thelementofu1. The maximumisachievedwheneachterm in thesumis non-negative, i.e. when If a vectorelementisexactlyzero, eithervalueofsiisequallygood. Mathematics of Biological Networks
Spectralmodularitymaximization This yieldsthefollowing simple algorithm: • Calculatetheeigenvectorofthemodularitymatrixcorrespondingtothelargest (most positive) eigenvalue • Thenassignverticestocommunitiesaccordingtothesignsofthevectorelements, positive signs in onegroupand negative signs in theother. In practice, thisworksverywell. E.g. thekarateclubisperfectlyclassified. In practicalapplications, itisworthwhiletousespectralmodularitymaximizationas a firststep, followedbytheKernighan-Lin methodtogetsomesmallfurtherimprovements. Mathematics of Biological Networks
Division intomorethan 2 groups In general, networkshave an arbitrarynumberofcommunities. Modularityissupposedtobelargestforthebestdivisionofthenetwork. As firstmethod, wecouldstartbydividingthenetworkinto 2 parts, andthenfurthersubdividethosepartsintosmallerones, and so forth. However, themodularityofthecompletenetworkdoes not break up (asthecutsizedoes) intoindependentcontributionsfromthe separate communities. The individual maximizationofthemodularitiesofthesecommunitiestreatedas separate networks will generally not producethemaximummodularityforthenetworkas a whole. Mathematics of Biological Networks
Division intomorethan 2 groups We must considerexplicitlythechangeQ in themodularityoftheentirenetwork upon furtherbisecting a communitycofsizenc. Bycomparingthemodularity after andbeforethedivision, we find (withoutderivation) B(c)isthenc ncmatrixwithelements Mathematics of Biological Networks
Division intomorethan 2 groups The equationhasthe same form asbefore. Thus, wecanapplyourspectralapproachtothisgeneralizedmodularitymatrixtomaximizeQ, find theleadingeigenvectoranddividethenetworkaccordingtothesignsofitselements. In repeatedlysubdividing a network in thisway, an importantquestioniswhentostop. The answerissimplywhenweareunableto find a (further) divisionwith a positive changeQ in themodularity. In thatcase, thebisectionalgorithm will put all vertices in oneofits 2 groupsandnone in theother, effectively „refusing“ tofurthersubdividethenetwork. At thispointweshouldstop. Mathematics of Biological Networks
Division intomorethan 2 groups The repeatedbisectionmethodworkswell in manysituations, but itisbynomeansperfect. A particularproblemisthatthereisnoguaranteethatthebestdivisionof a network in, say 3 parts, canbefoundbyfirstfindingthebestdivisioninto 2 partsandthensubdividingoneofthem. • showsthebestsubdivisionofthis linear graphwith 8 verticesand 7 edgesintotwogroupswith 4 verticeseach. • Shows thebestsubdivisioninto 3 groupswith 3, 2, 3 verticeseach. A repeatedbisectionalgorithmwouldnever find solution (b). Mathematics of Biological Networks
Other algorithmsforcommunitydetection An alternative wayoffindingcommunitiesofvertices in a networkistoloookfortheedgesthatliebetweencommunities. Ifwecan find andremovetheseedges, we will beleftwithisolatedcommunities. Onecommonwaytodefine „betweenness“ istousebetweennesscentrality(V1). Mathematics of Biological Networks
Review (V2): Betweenness Centrality Letusassume an undirectednetworkforsimplicity. Letnstibe 1 ifvertexi lies on thegeodesicpathfromstot and 0 ifitdoes not orifthereisno such path. Thenthebetweennesscentralityxiisgivenby This definitioncountsseparatelythegeodesicpaths in eitherdirectionbetween eachvertex pair. The equation also includespathsfromeachvertextoitself. Excludingthosewould not changetherankingofthevertices in termsofbetweenness. Also, weassumethatverticess andt belongtopathsbetweensandt. Iftherearetwogeodesicpathsofthe same lengthbetween 2 vertices, eachpathgets a weightequaltothe inverse ofthenumberofpaths. Vertices A and B are connectedby 2 geodesicpaths. Vertex C lies on both paths. Mathematics of Biological Networks
Review (V2): Betweenness Centrality Wemayredefinenstitobethenumberofgeodesicpathsfromstotthat pass through vertexi, anddefinegsttobethe total numberofgeodesicpathsfromstot. Then, thebetweennesscentralityofxiis whereweadopttheconventionthatnsti/ gst = 0 ifbothnstiandgstarezero. In thissketchof a network, vertex A lies on a bridgejoiningtwogroupsofothervertices. All pathsbetweenthegroups must pass through A, so ithas a high betweennesseventhoughitsdegreeislow. Mathematics of Biological Networks
Useedgebetweennessforcommunitydetection Defineedgebetweennessasthenumberofgeodesicpathsthatrunalongparticularedges. Weexpectthatedgesthatliebetweencommunities will have high valuesofthisedgebetweenness. In thisexample, twoedgesrunbetween thevertices in thetwodashedcircles. All shortestpathsbetweenverticesof thetwogroups will runalongoneof thesetwoedges. Edge betweennessiscomputedbydeterminingthegeodesicpathsbetweenevery pair ofvertices in thenetworkandcounthowmany such pathsgoalongeachedge. This takesO(n(m + n)). Mathematics of Biological Networks
Betweennessalgorithm • Calculatebetweeness score of all edges • Find theedgewiththehighest score andremove it. Becauseremovingthisedge will changethebetweennessscoresofsomeedges, anyshortestpathsthatpreviouslytraversedtheremovededge will nowhavetobererouted. Thus wehavetogo back tostep 1. The progressofthealgorithmcanberepresentedusing a tree. The progressive fragmentationofthe networkasedgedareremovedonebyone isrepresentedbythesuccessivebranching ofthetree. Ifwestop at thedashedline, the networkissplitinto 4 groupsof 6, 1, 2, and 3 vertices. Mathematics of Biological Networks
Hierarchicalclustering Network divisionbyedgebetweennessproduced a dendrogramthatisremiscentofclusteringmethods. Hierarchicalclusteringis an entireclassofalgorithms. Itis an agglomerativetechniquewherewestartwiththe individual verticesof a networkandjointhemtogetherto form groups. Weneed a measureofvertexsimilarity. Forthis, wecanusethemeasuresintroduced in V1 and V2, e.g. • cosinesimilarity, • correlationcoefficientsbetweenrowsoftheadjacencymatrix, • orthe so-calledEuclidiandistance. Havingmanychoicesforsimilaritymeasuresisboth a strengthand a weaknessofhierarchicalclusteringmethods. Itgivesthemethodflexibility, but theresults will differfromoneanother. Mathematics of Biological Networks
Hierarchicalclustering Once a similarityischosenwecalculatethesimilarityof all pairsofvertices in thenetwork. Thenwewould like toconnectthe mostsimilarvertices. However, theremaybeconflicting situations. Should A and C be in the same groupor not? The basicstrategyofhierarchicalclusteringistostartbyjoiningthosepairsofverticeswiththehighestsimilarities. These then form groupsofsize 2. This stepinvolvesnoambiguity. Thenwefurtherjointogetherthegroupsthataremostsimilarto form larger groups, andso on. Mathematics of Biological Networks
Hierarchicalclustering Duringthisprocesswenowrequire a measureforthesimilarityofgroups. Thereare 3 commonwaysofcombiningvertexsimilaritiestogivesimilarityscoresforgroups: - single-linkage - complete-linkage - average-linkageclustering. Consider 2 groupsofvertices, group 1 andgroup 2, withn1andn2vertices, respectively. Thentherearen1n2pairsofvertices such thatonevertexis in group 1 andtheother in group 2. In thesingle-linkageclusteringmethod, thesimilaritybetweenthe 2 groupsisdefinedasthesimilarityofthemostsimilarofthesen1n2pairsofvertices. Mathematics of Biological Networks
Hierarchicalclustering In thesingle-linkageclusteringmethod, thesimilaritybetweenthe 2 groupsisdefinedasthesimilarityofthemostsimilarofthesen1n2pairsofvertices. As theother extreme, complete-linkageclusteringdefinesthesimilaritybetweenthe 2 groupsasthesimilarityoftheleast similarpair ofvertices. In betweenthesetwo extremes istheaverage-linkageclusteringwherethesimilarityoftwogroupsisdefinedtobethemeansimilarityof all pairsofvertices. Mathematics of Biological Networks
Hierarchicalclustering This ishowhierarchicalclusteringisdone: • Choose a similaritymeasureandevaluateitfor all vertexpairs. • Assigneachvertexto a groupofitsown, consistingof just thatonevertex. The initial similaritiesofthegroupsaresimplythesimilaritiesofthevertices • Find the pair ofgroupswiththehighestsimilarityandjointhemtogetherinto a singlegroup. • Calculatethesimilaritybetweenthenewcompositegroupand all othersusingoneofthe 3 methodsabove (single, complete, average-linkage) • Repeat fromstep 3 until all verticeshavebeenjoinedinto a singlegroup. Mathematics of Biological Networks
Hierarchicalclusteringappliedtothekarateclub Partitioningofthekarateclubnetworkbyaveragelinkagehierarchicalclusteringusingcosinesimilarityasourmeasureofvertexsimilarity. Forthisexample, hierarchicalclusteringfoundtheperfectdivisionoftheclub. However, hierarchicalclusteringdoes not alwaysworkaswellashere. Mathematics of Biological Networks
Comparisonofmodularitymaximizationmethods A large numberofapproacheshavebeendevelopedtomaximizemodularityfordivisionsintoanynumberofcommunitiesofanysizes. Danon, Duch, Diaz-Guilera, Arenas, J. Stat. Mech. P09008 (2005) Mathematics of Biological Networks
Comparisonofmodularitymaximizationmethods One way to test the sensitivity of these methods is to see how well a particular method performs when applied to ad hoc networks with a well known, fixed community structure. Such networks are typically generated with n = 128 nodes, split into 4 communities containing 32 nodes each. Pairs of nodes belonging to the same community are linked with probability pin whereas pairs belonging to different communities are joined with probability pout. The value of pout is taken so that the average number of links a node has to members of any other community, zout, can be controlled. While pout(and therefore zout) is varied freely, the value of pin is chosen to keep the total average node degree, k constant, and set to 16. Danon, Duch, Diaz-Guilera, Arenas, J. Stat. Mech. P09008 (2005) Mathematics of Biological Networks
Comparisonofmodularitymaximizationmethods As zout increases, the communities become more and more diffuse and harder to identify, (see figure). Since the “real” community structure is well known in this case, it is possible to measure the number of nodes correctly classified by the method of community identification. Danon, Duch, Diaz-Guilera, Arenas, J. Stat. Mech. P09008 (2005) Mathematics of Biological Networks
Other modularitymaximizationmethods Oneofthemostsuccessfulapproachesissimulatedannealing. The process begins with any initial partition of the nodes into communities. At each step, a node is chosen at random and moved to a different community, also chosen at random. If the change improves the modularity it is always accepted, otherwise it is accepted with a probability exp(Q/kT). The simulation will start at high temperature T and is then slowly cooled down. Several improvements have been tested. Firstly, the algorithm is stopped periodically, or quenched, and Q is calculated for moving each node to every community that is not its own. Finally, the move corresponding to the largest value of Q is accepted. Mathematics of Biological Networks
Comparisonofmodularitymaximizationmethods Danon, Duch, Diaz-Guilera, Arenas, J. Stat. Mech. P09008 (2005) Mathematics of Biological Networks