MIS 644 Social Newtork Analysis 2017/2018 Spring

MIS 644Social Newtork Analysis2017/2018 Spring Chapter 5 Graph Partitioning and Community Detection

Outline • Introduction • Graph Partitioning • Community Detection • Simple Modularity Maximization • Spectral Modularity

Introduction • Graph partitioning and community detection • division of vertices of a network into • groups, clusters and communities • according to the patterns of edges • the groups formed • tightly with many edges inside the groups • few edges between the groups

Network of coauthorships in a university department

vertices: scientists • edges: coauthorship have a paper with • in the same group or having similar interests • ability to groups or clusters • structure and organization of networks

Partitioning or Community detection: • Graph Partitioning: • dividing vertices of a network into given number of non-overlapping groups of given sizes • number of edges between groups is minimized • number and sizes of the groups are fixed • arise in varity of circumstances: • computer science, pure and applied math., physics and study of networks • e.g., numerical solution of network processes on a parallel computer

Partitioning or Community detection: • Community detection: • the number and sizes of groups are not specified • determined by the network itself • goal - to find • natural fault lines along which the network seperates • use – as a tool • analysis and understanding • network data • clusters of nodes in a web graph • groups of related web pages

In Summary • difference • graph partitioning, community deterction • number , size of groups • network is divided • specified, unspecified • goals • gp - manageable pieces – numerical processing • cd - understanding structure of a network • large scale pattern • not visible

Graph Partitioning • The Kernighan-Lin algorithm • Spectral partitioning • Why partitioning is hard? • graph bisection: • division into two parts • repeated bisection • partitioning into arbitary number of parts • exhaustive search - looking • all possible divisions into two parts • costly computation time

The number of ways of dividing a network of n vertices into two parts of n1 and n2 vertices: n!/n1!n2!, • approximating by Stirling’s formula: n! = (2n)1/2(n/e)n, • using the fact that n = n1 + n2, • two parts of equal sizes • time to look through all divisions – • roughly exponentially

Partition of a network into two groups of equal sizes.

An algorithm • clever run quckly – fail to find optimal solution • or • find the optimal solution – impractical time • fail to find the very best division • find prety good one – good enough • approximate but acceptable solutions • heuristic algorithms - heuristics

The Kernighan-Lin Algorithm • given n and n1 and n2, • divide into two groups arbitarity – randomly • for each pair i j of vertices • i in one group, j in other • calculate how much the cut size would change if i and j were interchanged • find the pair i j that • reduces the cut size most if not • increases it by the smallest amout • swap the pair of vertices • the process is repeated – restriction • each vertex can be moved onces

Figure 11.2 of N-N The Kernighan-Lin algorithm. (a) The Kernighan-Lin algorithm starts with any division of the vertices of a network into two groups (shaded) and then searches for pairs of vertices, such as the pair highlighted here, whose interchange would reduce the cut size between the groups. (b) The same network after interchange of the two vertices.

the algorithm proceeds swaping on each step • the pair • most decreases or least increases • number of edges between groups • until no pairs is remains to swap • when all swaps completed • go back to every state the network passed • choise the state with smallest cut size

Finally this entire process is performed repeatedly • starting with the best division found in the last round • until – no inprovement in the cut size occurs • Returns • the division on the last round • as the best division • First round – random initial division • repeat the entire algorithm many times • choice the best division as the smallest of all

The Algorithm • Start random partitioning to n1 and n2 • Repeat • a round • until no imporvement in cut size • a round: • staart with best division in previous round • repeat min of n1,n2 • perform the best swap • best swap: • for all i,j not swaped • perform a swap and trace change in cut size

Graph partitioning applied to a small mesh network. (a) A mesh network of 547 vertices of the kind commonly used in finite element analysis (b) The edges removed indicate the best division of the network into parts of 273 and 274 vertices found by the Kernighan-Lin algorithm (c) The best division found by spectral partitioning

The K-L ends up with a sut size of 40

Disadvantage • quite slow • number of swaps – one round • smaller of the sizes of groups 0 – n/2 - O(n) • for each swap – examine all pairs • (n/2) x (n/2) = n2/4 – O(n2) • change in the cut size  = kiothers - kisame + kjothers - kjsame + -2Aij, • running all neighbors of i and j – average degree O(m/n)

For each round • O(n x n2 x m/n): • for sparse networks m n - O(n3) • for dense nets m n2 - O(n4) • How many rounds ? • Imporve for each round • store degrees of every node i and j • only update at each swap • calculate  in O(1) • running time O(n3) both spares and dense graphs

More then two pieces • Once divide into two pieces then, • divideinto more than two: • repeating the process • E.g., into three: • first into two n1:1/3, n2:2/3 • then n2 into two equal parts

Spectral Partitioning • n: vertices, m: edges – into grou1 and group2 • cut size: # edges between two groups

ΣjAij= ki, • where Lij = kiδij - Aij is the ijth element of the graph Laplacian matrix • in matrix notation

hard problem: • si is restricted to ±1. • relaxation method: • allow si to take any value • subject to a set of constraints • value - minimizes R • length of the vector s: √n,

The relaxation of the constraint allows s to point to any position on a hypersphere circumscribing • the original hypercube, rather than just the corners of the hypercube

second constraint: • nunber of +1 and -1 equals to group sizes n1 and n2, • vector form • where 1 is the vector (1, 1, 1, . . . ) • The problem • minimize the cut size • subject to the two constraints

c • taking derivatives • in matrix notation • 1 is an eigenvector of the Laplacian with • eigenvalue zero L · 1 = 0 Multiplying on the left by 1T • λ(n1 – n2 ) + μn = 0

defining a new vector • x is an eigenvector of the Laplacian with eigenvalue λ. • multiplying with 1T, • x is orthogonal to 1

x eigenvector smallest allowed eigenvalue • zero eigenvalue - eigenvector (1, 1, 1, . . . ) • x1=0: orthogonal to this lowest eigenvector. • x: v2 second lowest, eigenvalue λ2, • Finally, we recover the corresponding value of s from Eq. (11.30) thus:

s - ±1 n1+1, n2–1. • choose s to be as close as possible to our ideal value subject toits constraints,

as large as possible. • si = +1 for the • vertices with the largest xi + (n1 − n 2)/n • and si = −1 for the remainder. • eigenvector v2 • calculate the eigenvector v2, which has n • elements, one for each vertex in the network, and place the n1 vertices with the most positive • elements in group 1 and the rest in group 2.

Final algorithm • 1. Calculate the eigenvector v2 corresponding to the second smallest eigenvalue λ2 of the • graph Laplacian. • 2. Sort the elements of the eigenvector in order from largest to smallest. • 3. Put the vertices corresponding to the n1 largest elements in group 1, the rest in group 2, and • calculate the cut size. • 4. Then put the vertices corresponding to the n1 smallest elements in group 1, the rest in group 2, and recalculate the cut size. • 5. Between these two divisions of the network, choose the one that gives the smaller cut size.

Fig. 11.3c • the spectral method finds • cut size 46 edges • Kernighan-Lin40. • tends to find divisions of a network • right general shape, • but - not perhaps quite as good as • othermethods.

advantage • speed. • calculation of the eigenvector v2, O(mn) or O(n2) on a sparsenetwork having m  n. This is one factor of n better than the O(n3) of the Kernighan-Linalgorithm, • feasible for much larger networks. • hundreds of thousands of vertices, where the Kernighan-Lin algorithm • is restricted to networks of a few thousand vertices at most

Comunity Detection • the search for naturally occurring groups • regardless of their number and size • tool for discovering and understanding • structure of large-scale networks • seperate into groups of vertices • few connections between them • number and sizes are not fixed • simplest – graph bisection problem • dividing into two non-overlaping groups • without any constraint on sizes

find the division with minimum cut size • without any constraint on sizes of groups • Not – optimal: all in one group • cut size – zero • ratio cut partitioning • minimize R/(n1n2), n1=n2=n/2 largest denominator • biased towards equally sized groups • no principled rationale behind its definition

good measure - fewer than expected such edges • few edges expected as random • number of edges within groups • the two approaches - equivalent • given total edges • assortative mixing • vertices similar characteristics – connected • modularity • look for dividions with high modularity scores

Simple Modularity Maximization • Analog of Karnighan-Lin algorithm • divides into two communities • starting from an initial division of equal sized groups • for each vertex • calculates: how much modulartiy would change • if the vertex moves to the other group • choice the vertex • whose movment most increases or least decreases modularity • repeats the process • once a vertex is moved it can not be seleced

when all vertices has moved exactly once • go back over all states • select the one with maximum modularity • use the state as the starting point of a round • repeat rounds • untill modularity no longer improves

karatge club network

efficiency • at each step evaluate modularity change O(n) • each evaluation O(m/n) • each step O(m) • in each round n steps O(nm) • O(mn2) for the Karnighan-Lin algorithm • moving steps O(n) • swap steps of K-L O(n2)

Spectral Modularity Maximization • analog of spectral graph partitioning • modularity: • where • ci is the group i belongs, • δ(m, n) is the Kronecker delta, • modularity matrix with property

division into two parts • is 1 if i and j are in the same group • Kronicar delta: • then • in martix terms

where • s vector elements si, • B n × n matrix elements Bij - modularity matrix • length constraint • maximize modularity Q subject to constraint • taking the derivatives • in matrix form:

s is one of the eigenvectors - modularity matrix. • modularity is • s: u1: eigen vector corresponding to smallest eigenvalue • where [u1]i is the ith element of u1. The maximum is achieved when each term in the sum isnon-negative, i.e., when

can choose whichever we prefer. • And so we are led the following very simple algorithm. We calculate the eigenvector of the • modularity matrix corresponding to the largest (most positive) eigenvalue and then assign vertices • to communities according to the signs of the vector elements, positive signs in one group and • negative signs in the other. • In practice this method works very well. For example, when applied

MIS 644 Social Newtork Analysis 2017/2018 Spring

MIS 644 Social Newtork Analysis 2017/2018 Spring

Presentation Transcript