Network clustering

Network clustering Presented by Wooyoung Kim 2/6/2009 CSc 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

Outline Introduction Definitions and Basic Concepts Network Clustering Problem Hierarchical clustering Clique-based clustering Centre-based clustering Conclusion

Introduction • Clustering is… • Loosely defined as the process of grouping objects into sets called clusters so that each cluster consists of elements that are similar in some way. • Example: • Distance-based clustering : close given distance metric • Conceptual clustering: based on descriptive concepts

Introduction • Clustering is… • used for multiple purposes, including • Finding “natural” clusters (modules) and describing their properties • Classifying the data • Detecting unusual data (outliers) • Data reduction by treating a cluster or one of its element as a single representative unit

Introduction • Network clustering … • deals with clustering the data represented as a network or a graph • Link analysis • Data points are represented by vertices • An edge exists if two data points are similar or related in a certain way • Similarity criterion : • Pairwise relations - for network model • cohesiveness – for cluster similarity

Introduction • Network clustering approaches are used to perform … • Distance-based clustering: • Vertices are data points, and edges are for close points • Distances for weight the edges of a complete graph

Introduction • Conceptual clustering • Generating a concept description for each generated cluster • Design a matching field in database networks, then vertices are connected by an edge if the two matching fields are “close” • Example : • Protein interaction networks, proteins are vertices and a pair is connected by an edge if they are known to interact • Gene co-expression networks, genes are vertices and an edge indicates that the pair of genes (end points) are co-expressed over some cut-off value, based on microarray experiments.

Introduction • Application of network clustering • Understand the structure and function of proteins based on protein interaction maps of organisms • Clustering protein interaction networks (PINs) using cliques to decompose the Protein interaction network into functional modules and protein complexes • Use of cliques and other high density subgraphs to identify protein complexes (splicing machinery, transcription factors, etc.) and functional modules (signalling cascades, cell cycle regulation)

Introduction • Application of network clustering • Protein complexes: groups of proteins that interact with each other at the same time and place. • Functional modules : groups of proteins that are known to have pairwise interactions by binding with each other to participate in different cellular processes at different times

Definition and basic concepts • G=(V,E) is simple, undirected graph • n=|V|, e=|E| • is complement graph of G • Complement set of edges • G[S] is induced subgraph of G (induced by a subset S of V) • N(v) is set of neighbours of a vertex v in G (excluding v) • Degree deg(v)=|N(v)| • N[v]=N(v) U {v} • duv is distance between u and v • Length of the shortest path from u to v • dm=max(dij) for all vertex pairs i and j is diameter of a graph

Definition and basic concepts • Edge connectivity k’(G) of a graph : the minimum number of edges that must be removed to disconnect the graph • Vertex connectivity (or connectivity) k(G) of a graph : the minimum number of vertices that must be removed to disconnect the graph (or results in a trivial graph) • trivial graph: one vertex, no edges • connected graph: every pair of vertices are connected.

Definition and basic concepts • Example • (Vertex) connectivity k(G)=2 : removal of vertices 3 and 5 would disconnect the graph 2 8 9 11 5 1 12 3 7 10 4 6

Definition and basic concepts • Example • Edge connectivity k’(G)=2 : removal of edges (9,11) and (10,12) would disconnect the graph 2 8 9 11 5 1 12 3 7 10 4 6

Definition and basic concepts • A cliqueC is a subset of vertices such that an edge exists between every pair of vertices in C • The induced subgraph G[C] is a complete graph • A clique is maximal if it is not a subset of any larger clique • A clique is maximum if there are no larger cliques in the graph • A subset of vertices I is called an independent set (also called a stable set) if for every pair of vertices in I, (i, j) is not an edge • Induced subgraph G[I] is edgeless • An independent set is maximal if it is not a subset of any larger independent set • An independent set is maximum if there are no larger independent sets in the graph

Definition and basic concepts • Example: maximal clique • {1,2,3} is a maximal clique 2 8 9 11 5 1 12 3 7 10 4 6

Definition and basic concepts • Example: maximum clique • {7,8,9,10} is the maximum clique 2 8 9 11 5 1 12 3 7 10 4 6

Definition and basic concepts • Example : maximal independent set • I={3,7,11} is a maximal independent set as there is no larger independent set containing it 2 8 9 11 5 1 12 3 7 10 4 6

Definition and basic concepts • Example: maximum independent set • The set {1,4,5,10,11} is a maximum independent set, one of largest cardinality in the graph 2 8 9 11 5 1 12 3 7 10 4 6

Definition and basic concepts • C is a clique in G if and only if C is an independent set in the complement graph • Clique number ω(G) is the cardinalities of a maximum clique • Independence number α(G) is the cardinalities of a maximum independent set

Definition and basic concepts Algorithm for maximal independent set

Definition and basic concepts • Algorithm for maximal independent set 2 5 1 3 7 4 6

Definition and basic concepts • A dominating set is a set of vertices such that every vertex in the graph is either in this set or has a neighbour in this set • Dominating set is minimal if it contains no proper subset which is dominating • Dominating set is a minimum dominating set if it is of the smallest cardinality • Cardinality of a minimum dominating set is called the domination number γ(G) of a graph

Definition and basic concepts D = {7, 11, 3} is a minimal and minimum dominating set 2 8 9 11 5 1 12 3 10 7 4 6

Definition and basic concepts • A connected dominating set is one in which the subgraph induced by the dominating set is connected • An independent dominating set is one in which the dominating set is also independent

Network clustering problem • Given a graph G=(V,E), find subsets (not necessarily disjoint) {V1,...,Vr} of V such that V= UVi i=1,…,r such that • Each subset is a cluster modelled by structures such as cliques or other distance and diameter-based models • The model used as a cluster represents the cohesiveness required of the cluster

Network clustering problem • The clustering models can be classified • By the constraints on relations between clusters (clusters may be disjoint or overlapping) • The objective function used to achieve the goal of clustering (minimizing the number of clusters or maximizing the cohesiveness) • When clusters are required to be disjoint •  {V1,...,Vr} is cluster -partition Exclusive clustering • When clusters are allowed to overlap •  {V1,...,Vr} is a cluster-over  Overlapping clustering

Network clustering problem • Assume that there is a measure of cohesiveness of the cluster that can be varied for a graph G  define two types of optimization problems: • Type I: Minimize the number of clusters while ensuring that every cluster formed has cohesiveness over a prescribed threshold • Example: The problem of clustering an incomplete graph with cliques used as clusters and the objective of minimizing the number of clusters

Network clustering problem • Type II: Maximize the cohesiveness of each cluster formed, while the number of clusters is K (the last requirement may be relaxed by setting K be infinite ) • Example: assume that G has non-negative edge weights w, for a cluster Vi let Ei denote the edges in the subgraph induced by Vi • Use w as a dissimilarity measure (distance) • For example, w(Ei)=∑e in Ew(e) is meaningful measures of cohesiveness • can be used to formulate a Type II clustering problem • We will refer to problems as Type I and Type II based on their objective

Hierarchical clustering • After performing clustering, we can abstract the graph G0 to a graph • G1 = (V1, E1) as the followings; • There exists a vertex vi1 in V1 for every subset (cluster) Vi0 • There exists an edge between vi1 and vj1 if and only if there exist a vertex x in the cluster Vi and a vertex y in cluster Vj • In other words: if any two vertices from different clusters have an edge between them in the original graph G0 clusters containing them are made adjacent in G1

Hierarchical clustering • We can recursively cluster the abstracted graph G1 in a similar fashion to obtain a multilevel hierarchy • Process is called hierarchical clustering • Example • Following subsets form clusters in this graph: C1={7,8,9,10},C2={1,2,3},C3={4,6},C4={11,12},C5={5} • Given the clusters of the example graph G  we can construct an abstracted graph G’

Clique-based clustering • Natural choice for a highly cohesive cluster • Cliques have • Minimum possible diameter • Maximum connectivity • Maximum possible degree for each vertex • Given an arbitrary graph • Type I approach tries to partition it into (or cover it using) minimum number of cliques • Type II approaches usually work with a weighted complete graph and hence every partition of the vertex set is a clique partition

Clique-based clustering • Minimum clique partitioning • Type I clique partitioning and clique covering problems are both NP-hard [Garey79] • Heuristic approaches are preferred for large graphs • Note that clique-partitioning and clique-covering problems are closely related • Minimum number of clusters produced in clique covering and partitioning are the same

Clique-based clustering • Clique-partitioning and Clique-covering problems • p: optimal number of covering • c: optimal number of partitioning • p c, since every clique partition is also a cover. • c p, since any vertex v present in multiple clusters causing overlaps can be removed from all but one of the clusters and one less overlap. Repeating this until it results in a clique partition with the same number of clusters c.

Clique-based clustering Simple heuristic for clique partitioning

Clique-based clustering • Example 2 8 9 11 5 1 12 3 10 7 4 6

Clique-based clustering • Example

Clique-based clustering • Min-Max k-Clustering • A Type II clique partitioning problem with min-max objective • Consider a weighted complete graph G=(V,E) with weights w(e1)≤w(e2)≤…≤w(em) , m=n(n−1)/2 • Partition the graph into no more than k cliques s.t. the maximum weight of an edge between two vertices inside a clique is minimized • In other words, if V2,...,Vk is the clique partition, then we wish to minimize maxi=1…kmaxu,v in Viw(u,v) • The weight w(i,j) can be thought of as a measure of dissimilarity • Larger w(i,j) means more dissimilar i and j are • Problem tries to cluster the graph into at most k cliques such that the maximum dissimilarity between any two vertices inside a clique is minimized

Clique-based clustering • Min-Max k-Clustering • Given any graph G=(V,E), the required edge weighted complete graph G can be obtained in different ways using meaningful measures of dissimilarity • The weight w(i,j) could be dij, the shortest distance of i and j in G • The weight could be k(i,j) and k’(i,j) (minimum number of vertices and edges that need to be removed from G to disconnect i and j) • Since these are measures of similarity  we could obtain the required weights as w(i,j)=|V|−k(i,j) or w(i,j)=|E|−k’(i,j)

Clique-based clustering • Bottleneck graph • Bottleneck graph of a weighted graph G=(V,E) is defined for a given number c as follows • G(c)=(V, Ec) where Ec={e E|w(e)≤c} • Bottleneck graph G(c) contains only those edges with weight at most c • Example: Complete weighted graph G and its bottleneck graphs G(1) and G(2) for weights 1 and 2, respectively

Clique-based clustering • Bottleneck heuristic for the min-max k-clustering problem

Clique-based clustering • Procedure bottleneck(…) returns the bottleneck graph G(…) • MIS(…) is an arbitrary procedure for finding a maximal independent set (MIS) in G • This algorithm will be optimal if we manage to find a maximum independent set (one of largest size) in every iteration • Problem is NP-hard • We have to restrict ourselves to finding MIS using heuristic approaches such as the greedy approach described earlier to have a polynomial time algorithm

Clique-based clustering • Example • Clustering output of the bottleneck min-max k-clustering algorithm with k=2 for following graph

Clique-based clustering Result

Center-based clustering • In center-based clustering models, the elements of a cluster are determined based on their similarity with the cluster’s center (or cluster-head) • Center-based clustering algorithms usually consist of two steps • First, an optimization procedure is used to determine the cluster-heads • Second, the cluster-heads are then used to form clusters around them

Center-based clustering • Clustering with dominating sets • Minimum dominating set and related problems provide a modelling tool for centre-based clustering of Type I • The minimum dominating set problem is NP-hard  heuristic approaches and approximation algorithms are used to find a small dominating set • If D denotes a dominating set, then for each vertex v in D the closed neighbourhood N[v] forms a cluster • By the definition of domination, every vertex not in the dominating set has a neighbour in it and hence is assigned to some cluster

Center-based clustering • Each v in D is called a cluster-head and the number of clusters that result is exactly the size of the dominating set • Minimizing the size of the dominating set  minimize the number of clusters produced resulting in a Type I clustering problem • This approach results in a cluster cover since the resulting clusters need not to be disjoint

Center-based clustering • Each cluster has diameter at most two as every vertex in the cluster is adjacent to its cluster-head and the cluster-head is “similar” to all the other vertices in its cluster • However, neighbours of the cluster-head may be poorly connected among themselves • Some post-processing may be required as a cluster formed in this fashion from an arbitrary dominating set could completely contain another cluster • Clustering with dominating sets is especially suited for clustering protein interaction networks • To reveal groups of proteins that interact through a central protein which could be identified as a cluster-head in this method

Center-based clustering • Independent Dominating Sets • Finding a maximal independent set results also in a minimal independent dominating set • Can be used in clustering the graph • Here, no cluster formed can contain another cluster completely, as the cluster-heads are independent and different

Center-based clustering • Greedy algorithm for minimal independent dominating sets • Proceeds by adding a maximum degree vertex to the current independent set and then deleting that vertex along with its neighbours • Greedy because it adds a maximum degree vertex so that a larger number of vertices are removed in each iteration, yielding a small independent dominating set • This is repeated until no more vertices exist

Network clustering

Network clustering

Presentation Transcript

Clustering

Using Clustering Information for Sensor Network Localization

Clustering

BSP Clustering Algorithm for Social Network Analysis

Clustering

Clustering

On Network-Aware Clustering of Web Clients

Clustering

Clustering

Clustering

Clustering: Partition Clustering

CLUSTERING SCHEMES FOR MOBILE AD HOC NETWORK

Clustering-A neural network approach

Clustering

Clustering and Network

Network-Aware Clustering of Web Clients

CLUSTERING SCHEMES FOR MOBILE AD HOC NETWORK

Clustering of Interaction Network

Clustering

Clustering

Clustering