380 likes | 611 Views
V3 Matrix algorithms and graph partitioning. - Dividing networks into clusters - Graph partitioning - The Kernighan -Lin algorithm - Spectral partitioning. Motivation: Cellular networks are modular!. Adam Arkin / UC Berkeley
E N D
V3 Matrix algorithmsandgraphpartitioning - Dividingnetworksintoclusters - Graph partitioning - The Kernighan-Lin algorithm - Spectralpartitioning Mathematics of Biological Networks
Motivation: Cellularnetworksare modular! Adam Arkin / UC Berkeley Modularityis one way to reconcile the seemingly incompatible objectives of complexity and evolvability. Modularity has been shown to underlie biological function e.g. at the levels of transcription and embryonic development. Over half of all functional modules (in the form of transcriptional modules, protein complexes, and metabolic pathways) have coevolving components. Modularity is hierarchical. Singh … Arkin PNAS (2008) 105, 7500-7505 Mathematics of Biological Networks
Evolutionarymodules in chemotaxis Orthologsof 61 B. subtilischemotaxis genes from 207 microbial species. The resulting gene content matrix was hierarchically clustered along both genes (rows) and species (columns). Genes were then colored according to which dynamic-control role they occupy in the network. The clustering reveals that genes group into 5statistically significant evolutionary modules (A–E). (i) flagellargenes (flg, fli, flh) are conserved among motile bacteria but not among motile Archaea; (ii) the full complement of signal transducers (mcp, tlp) and regulators (che) is absent in many intracellular pathogens. Singh … Arkin PNAS (2008) 105, 7500-7505 Mathematics of Biological Networks
FBA-optimized network on glutamate-rich substrate High-fluxbackbonefor FBA-optimizedmetabolicnetworkofE. coli on a glutamate-richsubstrate. Metabolites (vertices) colouredbluehave at least oneneighbour in common in glutamate- andsuccinate-richsubstrates, andthosecolouredredhavenone. Reactions (lines) arecolouredblueiftheyareidentical in glutamate- andsuccinate-richsubstrates, greenif a different reactionconnectsthe same neighbour pair, andredifthisis a newneighbour pair. Black dottedlinesindicatewherethedisconnectedpathways, forexample, folatebiosynthesis, wouldconnecttotheclusterthrough a link thatis not partofthe HFB. Thus, therednodesand links highlightthepredictedchanges in the HFB whenshiftingE. colifromglutamate- tosuccinate-richmedia. Dashedlinesindicate links tothebiomassgrowthreaction. (1) Pentose Phospate (11) Respiration (2) Purine Biosynthesis (12) Glutamate Biosynthesis (20) Histidine Biosynthesis (3) Aromatic Amino Acids (13) NAD Biosynthesis (21) Pyrimidine Biosynthesis (4) Folate Biosynthesis (14) Threonine, Lysine and Methionine Biosynthesis (5) Serine Biosynthesis (15) Branched Chain Amino Acid Biosynthesis (6) Cysteine Biosynthesis (16) Spermidine Biosynthesis (22) Membrane Lipid Biosynthesis (7) Riboflavin Biosynthesis (17) Salvage Pathways (23) Arginine Biosynthesis (8) Vitamin B6 Biosynthesis (18) Murein Biosynthesis (24) Pyruvate Metabolism (9) Coenzyme A Biosynthesis (19) Cell Envelope Biosynthesis (25) Glycolysis (10) TCA Cycle Almaar et al., Nature 427, 839 (2004) Mathematics of Biological Networks
RNA polymerases I, II and III Again: modular decompositon easier to comprehend than graph Gagneur et al. Genome Biology 5, R57 (2004) Mathematics of Biological Networks
Dividingnetworksintoclusters We like todividetheverticesof a graph so thatvertices in onegrouphavemanyedgestootherverticesinsidethe same groupandonly a fewedgestovertices in othergroups. Network ofCo-authorships in a universitydepartment. Vertices arescientistsandedges link pairsofscientistswhohave co-authoredscientific publications. The networkhasclearclusters or „communities“ thatlikely reflectdivisionsofinterests andresearchgroups. Mathematics of Biological Networks
Graph partitioning Graph partitioningandcommunitydetectionaredistinguishedfromoneanotherbywhetherthenumberandsizeofthegroupsisfixedbytheexperimenterorwhetheritisunspecified. Graph partitioningis a classic problemin computerscience, studiedsincethe 1960s. Itistheproblemofdividingtheverticesof a networkinto a givennumberof non-overlappinggroupsofgivensizes such thatthenumberofedgesbetweengroupsisminimized. Oneimportantapplicationisthe optimal distributionofverticesontocoresof a parallel computer in ordertomimimizetheamountofcommunicationrequiredwhensolvinge.g. a systemofcoupledequations. Mathematics of Biological Networks
Community detection In communitydetection, thenumberandsizeofthegroupsintowhichthenetworkisdividedare not specifiedbytheexperimenter but bythenetworkitself. The goalofcommunitydetectionisto find thenatural fault linesalongwhich a network separates. Mathematics of Biological Networks
Algorithms for graph partitioning Whyispartitioninghard? The simplestgraphpartitioningproblemisthedivisionof a networkinto just 2 parts. This issometimescalledgraphbisection. Ifwecandivide a networkinto2 parts, wecan also divide itfurtherbydividingoneorbothoftheseparts … The graphbisectionproblemistheproblemofdividingtheverticesof a networkinto2 non-overlappinggroupsofgivensizes such thatthenumberofedgesrunningbetweenvertices in different groupsisminimized. The numberofedgesbetweengroupsiscalledthecutsize. Thus, onecouldsimplylookthrough all possibledivisionsofthenetwork into 2 partsandchoosetheonewithsmallestcutsize. Mathematics of Biological Networks
Algorithms for graph partitioning But this exhaustive searchisprohibitively expensive! Given a networkofn vertices. Thereare different ways ofdividingit into 2 groupsofn1andn2vertices. The amountof time tolookthrough all thesedivisions will gouproughlyexponentiallywiththesizeofthesystem. Onlyvaluesofupton = 30 arefeasiblewithcurrentcomputers. In computerscience, either an algorithmcanbe clever andrunquickly, but will failto find the optimal answer in some (andperhapsmost) cases, orit will always find the optimal answer, but takes an impracticallengthof time to do it. Mathematics of Biological Networks
The Kernighan-Lin algorithm This algorithmproposedby Brian KernighanandShen Lin in 1970 isoneofthesimplestandbestknownheuristicalgorithmsforthegraphbisectionproblem. (Kernighanis also oneofthedevelopersofthe C language). (a) The algorithmstartswithanydivisionoftheverticesof a networkintotwogroups (shaded) andthensearchesforpairsofvertices, such asthe pair highlightedhere, whoseinterchangewouldreducethecutsizebetweenthegroups. (b) The same network after interchangeofthe 2 vertices. Mathematics of Biological Networks
The Kernighan-Lin algorithm • Dividetheverticesof a givennetworkinto 2 groups (e.g. randomly) • Foreach pair (i,j) ofvertices, wherei belongstothefirstgroupandj tothesecondgroup, calculatehowmuchthecutsizebetweenthegroupswouldchangeifi andjwereinterchangedbetweenthegroups. • Find the pair thatreducesthecutsizebythelargestamount. Ifno pair reducesit, find the pair thatincreasesitbythesmallestamount. Repeat thisprocess, but withtheimportantrestrictionthateachvertex in thenetworkcanonlybemovedonce. Stopwhenthereisno pair ofverticesleftthatcanbeswapped. Mathematics of Biological Networks
The Kernighan-Lin algorithm (II) • Go back througheverystatethatthenetworkpassedthroughduringtheswappingprocedureandchooseamongthemthestate in whichthecutsizetakesitssmallestvalue. • Performthisentireprocessrepeatedly, startingeach time withthebestdivisionofthenetworkfound in the last round. • Stopwhennoimprovement on thecutsizeoccurs. Note thatifthe initial assignmentofverticestogroupisdonerandomly, theKernighan-Lin algorithmmaygive different answerswhenitisruntwice on the same network. Mathematics of Biological Networks
The Kernighan-Lin algorithm (II) (a) A meshnetworkof 547 verticesofthekindcommonlyused in finite elementanalysis. (b) The bestdivisionfoundbytheKernighan-Lin algorithmwhenthetaskistosplitthenetworkinto 2 groupsofalmostequalsize. This divisioninvolvescutting 40 edges in thismeshnetworkandgivespartsof 273 and 274 vertices. (c) The bestdivisionfoundbyspectralpartitioning (second half of V3). Mathematics of Biological Networks
Runtime of the Kernighan-Lin algorithm The numberofswapsperformedduringoneroundofthealgorithmisequaltothesmallerofthesizesofthetwogroups [0, n / 2]. → in theworstcase, thereareO(n) swaps. Foreachswap, wehavetoexamine all pairsofvertices in different groupstodeterminehowthecutsizewouldbeaffectedifthe pair was swapped. In theworstcase, therearen / 2 n / 2 = n2 / 4 such pairs, whichisO(n2). Mathematics of Biological Networks
RuntimeoftheKernighan-Lin algorithm (ii) When a vertexi movesfromonegrouptotheothergroup, anyedgesconnectingittovertices in itscurrentgroupbecomeedgesbetweengroups after theswap. Letussupposethatarekisame such edges. Similarly, anyedgesthati hastovertices in theothergroup, (saykiotherones) becomewithin-group edges after theswap. Thereisoneexception. Ifi isbeingswappedwithvertexj andtheyareconnectedby an edge, thentheedgeis still betweenthegroups after theswap → thechange in thecutsize due tothemovementofi iskiother - kisame – Aij A similarexpressionappliesforvertexj. → the total change in cutsize due totheswapiskiother - kisame +kjother - kjsame – 2Aij Mathematics of Biological Networks
RuntimeoftheKernighan-Lin algorithm (iii) For a networkstored in adjacencylist form, theevaluationofthisexpressioninvolvesrunningthrough all theneighborsofi andj in turn, andhence takes time on theorderoftheaveragedegree in thenetwork, orO (m/n) withm edges in thenetwork. → the total running time isO ( n n2 m/n ) = O(mn2) On a sparsenetworkwithm n, thisisO(n3) On a densenetwork (with) , thisisO(n4) This time still needstobemultipliedbythenumberofroundsthealgorithmisrunbeforethecutsizestopsdecreasing. Fornetworksupto a few 1000 ofvertices, thisnumbermaybebetween 5 and 10. Mathematics of Biological Networks
Spectral partitioning This method was presentedby Fiedler (1973) andmakesuseofthematrixpropertiesofthegraphLaplacian. Againwe will applythisalgorithmtothegraphbisectionproblem. Given a networkofn verticesandm edgesand a divsionintogroup 1 andgroup 2. Wecanwritethecutsizeforthedivisionas The factor ½ compensatesforcountingeachedgetwice in thesum. Mathematics of Biological Networks
Spectral partitioning Let usdefine a setofquantitiessi , oneforeachvertexi, whichrepresentthedivisionofthenetworkthus: Then This allowstorewritethecutsizeas wherethesumnowrunsover all valuesofi andj. Mathematics of Biological Networks
Spectral partitioning The firstterm in thesumis wherekiisthedegreeofvertexi , ijistheKroneckersymbol. Wehave also usedthefactthatandsi2 = 1 Substitutingthis back intotheaboveequationgives or in matrix form, wheresisthevectorwithelementssiand Lij = kiij – Aijistheij-thelementofthegraphLaplacianmatrix. Mathematics of Biological Networks
Insert: graph Laplacian The graphLaplacianisdefined in analogytothediffusionprocess. Diffusion processesarenormallytreatedbythediffusionequation But onecan also considerdiffusionprocessesthattakeplace on networks. There, the rate at whichtheamountofsubstancei at vertexichangesis determinedbytheflowfromandtootherverticesjthatareconnectedtoi. This gives . Bysplittingthe 2 termswecanwrite Mathematics of Biological Networks
Insert: graph Laplacian We canwritethisequation in matrix form where isthevectorwithcomponentsi , Aistheadjacencymatrixand D is a diagonal matrixwiththevertexdegrees on the diagonal. Itiscommontodefine a newmatrixL = D - A so thattheaboveequationtakes on the form oftheordinarydiffusionequation Here, theLaplacianoperatorofthesecondspatial derivatives isreplacedbyL. Mathematics of Biological Networks
Insert: eigenvectorsofthegraphLaplacian Consider an undirectednetworkwithnverticesandmedges. Letusdesignateone end ofeachedgetobeend 1andtheothertobeend 2. Nowletusdefinethem nincidencematrix Bwithelementsasfollows Thus, eachrowofBhasexactlyone +1 andone -1 element. Mathematics of Biological Networks
Insert: eigenvectorsofthegraphLaplacian Let usnowconsiderthesumthatrunsover all edgesforthe vertex pair i andj. Ifi j , theonly non-zero terms in thissumoccurifbothBkiandBkjare non-zero. In thatcase, edgekconnectsverticesiandj, andtheproduct will be -1. For a simple network, thereis at mostoneedgebetweenany pair ofvertices. Thus, theentiresum will be -1 ifthereis an edgebetweeniandjand 0 otherwise. Ifi =j, thenthesumiswhichcontributes a valueof +1 foreveryedgeconnectedtovertexi, so thewholesumis just equaltodegreekiofvertexi. Mathematics of Biological Networks
Insert: eigenvectorsofthegraphLaplacian Thus thesumispreciselyequalto an elementoftheLaplacian The diagonal elementsLiiareequaltothedegreeskiandthe off-diagonal terms Lijare -1 ifthereis an edge(i,j)andzerootherwiwse. In matrix form, wecanwriteL = BT B Nowletvibe an eigenvectorofLwitheigenvaluei. ThenviTBTBvi = vTi Lvi= ivTivi = i whereweassumedthattheeigenvectorviisnormalizedtothat itsinnerproductwithitselfis 1. Mathematics of Biological Networks
Insert: eigenvectorsofthegraphLaplacian Thus anyeigenvalueioftheLaplacianisequalto (viTBT) (Bvi). This quantityis just theproductof a real vectorwithitself.Itisthesumofthesquaresofthe real elementsofthatvectorandhencecannotbe negative. Itfollowsthat all eigenvaluesofthegraphLaplacianare non-negative. The smallestpossibleeigenvalueis 0. Letusconsiderthevector1 = (1,1,1,…). IfwemultiplythisvectorbytheLaplacian, thei-thelementhasthevalue Thus thevector1isalways an eigenvectorofLwiththesmallestpossibleeigenvalue 0. Mathematics of Biological Networks
Back tospectralpartitioning R was thecutsizeforthedivision, i.e., thenumberofedgesrunningbetweenthe 2 groups: wheresisthevectorwithelementssiandLij = kiij – Aijistheij-thelementofthegraphLaplacianmatrix. The matrixL specifiesthestructureofthenetworkandthevectorsdefines a divisionofthatnetworkintogroups. Ourgoalisto find thevectorsthatminimizesthecutsizeforgivenL. In general, thismimizationproblemis not easy tosolvebecausethevaluessiarerestrictedto +1 and -1. Mathematics of Biological Networks
Relaxation method If thevaluessicouldtake on any real value, wecouldcomputethe derivative, setthistozero, and find theminimum. We will solvethisproblemwiththerelaxationmethodwherewe will relax theserestraints. This isoneofthestandardmethodsforfindingapproximatesolutionsofvectoroptimizationproblems. The allowedvaluesofsiareactuallysubjectto 2 constraints. First, each individual elementsi canonlyhavethevalues +1 and -1. Ifweregardsas a vector in Euclidianspacethenthisconstraintmeansthatthevectoralwayspointstooneofthe 2ncornersof an n-dimensional hypercubecentered on theorigin. Italwayshasthe same length. Mathematics of Biological Networks
Back tospectralpartitioning Let usnow relax theconstraint on thevector‘s direction, so thatitcanpoint in anydirection in itsn-dimensional space. We will however still keepitslengththe same. So s will beallowedtotakeanyvalue, subjecttotheconstraintor The secondconstraint on thesiisthatthenumbersofthemthatareequalto +1 and -1 respectively must beequaltothedesiredsizesofthe 2 groups. Ifthese 2 sizesaren1andn2 , thissecondconstraintcanbewrittenas or in vectornotationwhere1T= (1,1,1, …) Wekeepthissecondconstraintunchanged. Mathematics of Biological Networks
Minimizecutsizesubjecttoconstraints Our taskisthereforetomimizethecutsize subjecttothe 2 constraints WedifferentiateRwithrespecttotheelementssiandenforcetheconstraintsusingtwo Lagrange multiplierswhichwedenoteand 2 Mathematics of Biological Networks
Back tospectralpartitioning Performingthe derivatives, wethen find that or in matrixnotation L s = s + 1 Wecancalculatethevalueof byrecallingthat1 is an eigenvectoroftheLaplacianwitheigenvalue 0 so thatL 1 = 1T L = 0. Mathematics of Biological Networks
RelationshiptoeigenvectorsofLaplacian MultiplyingL s = s + 1 fromtheleftby1Tgives with weget or Ifwedefinethenewvector then In otherwords, xis an eigenvectoroftheLaplacianwitheigenvalue. Mathematics of Biological Networks
Whicheigenvectorshouldbechoose? We are still freetochoosewhicheigenvectoritis. WeshouldchoosetheonethatgivesthesmallestvalueofthecutsizeR. Noticethat Thus, xis orthogonal to1. Whilexshouldbe an eigenvectorofL , itcannotbetheeigenvector (1,1,1,…) thathaseigenvalue 0. Whicheigenvectorshouldwechooseinstead? Mathematics of Biological Networks
Chooseeigenvectorwithsmallesteigenvalue We also have andhence Thus, thecutsizeis proportional totheeigenvalue. SinceourgoalistomimizeR, weshouldchoosextobetheeigenvectorcorrespondingtothesmallestallowedeigenvalueoftheLaplacian. Mathematics of Biological Networks
Optimal networkdivision We havealreadyshownthatx must be orthogonal totheeigenvector1withthesmallesteigenvalue 0. The bestthingwecan do ischoosex proportional toeigenvectorv2correspondingtothesecondlowesteigenvalue. Finallywerecoverthebestvalueforthenetworkdivisions : orequivalently This givesusthe optimal relaxed valueofs. Mathematics of Biological Networks
Network partitioningwithconstraints There is, however, an additional constraintthattheelementsofsshouldhavethevaluesof +1 and -1. Moreover, exactlyn1ofthemshouldbe +1 andn2shouldbe -1. Thus, thevaluesofscannotexactlytake on thevalues. Letusdo thebestwecanandchoosesascloseaspossibletothese ideal values subjecttoitscontraintsbymakingtheproduct as large aspossible. This isachievedbyassigningsi = +1 fortheverticeswiththe larges (i.emost positive) valuesofandthuslargestvaluesofxiandsi = -1 totheremainingones. Mathematics of Biological Networks
Algorithmforspectralpartitioning Thereis a furthersubtlety. Itisarbitrarywhichgroupwecallgroup 1 andwhichwecallgroup 2. Thus, ifthesizesofthetwogroupsare different therearetwo different waysofmakingthesplit. Wesolvethis in a pragmaticway, seebelow. Thus our final algorithmisasfollows: • Calculatetheeigenvectorv2correspondingtothesecondsmallesteigenvalueofthegraphLaplacian. • Sorttheelementsoftheeigenvector in orderfromlargesttosmallest. • Puttheverticescorrespondingtothen1largestelements in group 1, therest in group 2 andcalculatethecutsize. • The puttheverticescorrespondingtothen1smallestelementsin group 1, therest in group 2 andcalculatethecutsize. • Betweenthese 2 divisionsofthenetwork, choosetheonethatgivesthesmallercutsize. Mathematics of Biological Networks
Resultsandcomplexityofspectralpartitioning Forthisnetwork, thecut-size oftheKernighan-Lin algorithmis 40, whereasspectralpartitioningcuts 46 edges. Spectralpartitioningtendsto find divisionsof a networkthathave „therightgeneralshape“. Speed: The time-consumingpartofspectralpartitioningiscomputationoftheeigenvectorwhichtakes time O(mn)orO(n2) on a sparsenetwork. This is an orderofmagnitudebetterthantheKernighan-Lin algorithm (O(n3) ). Mathematics of Biological Networks