650 likes | 670 Views
Explore network clustering approaches for analyzing biological data, including hierarchical, clique-based, and center-based methods. Understand the concepts, applications, and implications of network clustering in biological research.
E N D
Network clustering Presented by Wooyoung Kim 2/6/2009 CSc 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan
Outline Introduction Definitions and Basic Concepts Network Clustering Problem Hierarchical clustering Clique-based clustering Centre-based clustering Conclusion
Introduction • Clustering is… • Loosely defined as the process of grouping objects into sets called clusters so that each cluster consists of elements that are similar in some way. • Example: • Distance-based clustering : close given distance metric • Conceptual clustering: based on descriptive concepts
Introduction • Clustering is… • used for multiple purposes, including • Finding “natural” clusters (modules) and describing their properties • Classifying the data • Detecting unusual data (outliers) • Data reduction by treating a cluster or one of its element as a single representative unit
Introduction • Network clustering … • deals with clustering the data represented as a network or a graph • Link analysis • Data points are represented by vertices • An edge exists if two data points are similar or related in a certain way • Similarity criterion : • Pairwise relations - for network model • cohesiveness – for cluster similarity
Introduction • Network clustering approaches are used to perform … • Distance-based clustering: • Vertices are data points, and edges are for close points • Distances for weight the edges of a complete graph
Introduction • Conceptual clustering • Generating a concept description for each generated cluster • Design a matching field in database networks, then vertices are connected by an edge if the two matching fields are “close” • Example : • Protein interaction networks, proteins are vertices and a pair is connected by an edge if they are known to interact • Gene co-expression networks, genes are vertices and an edge indicates that the pair of genes (end points) are co-expressed over some cut-off value, based on microarray experiments.
Introduction • Application of network clustering • Understand the structure and function of proteins based on protein interaction maps of organisms • Clustering protein interaction networks (PINs) using cliques to decompose the Protein interaction network into functional modules and protein complexes • Use of cliques and other high density subgraphs to identify protein complexes (splicing machinery, transcription factors, etc.) and functional modules (signalling cascades, cell cycle regulation)
Introduction • Application of network clustering • Protein complexes: groups of proteins that interact with each other at the same time and place. • Functional modules : groups of proteins that are known to have pairwise interactions by binding with each other to participate in different cellular processes at different times
Definition and basic concepts • G=(V,E) is simple, undirected graph • n=|V|, e=|E| • is complement graph of G • Complement set of edges • G[S] is induced subgraph of G (induced by a subset S of V) • N(v) is set of neighbours of a vertex v in G (excluding v) • Degree deg(v)=|N(v)| • N[v]=N(v) U {v} • duv is distance between u and v • Length of the shortest path from u to v • dm=max(dij) for all vertex pairs i and j is diameter of a graph
Definition and basic concepts • Edge connectivity k’(G) of a graph : the minimum number of edges that must be removed to disconnect the graph • Vertex connectivity (or connectivity) k(G) of a graph : the minimum number of vertices that must be removed to disconnect the graph (or results in a trivial graph) • trivial graph: one vertex, no edges • connected graph: every pair of vertices are connected.
Definition and basic concepts • Example • (Vertex) connectivity k(G)=2 : removal of vertices 3 and 5 would disconnect the graph 2 8 9 11 5 1 12 3 7 10 4 6
Definition and basic concepts • Example • Edge connectivity k’(G)=2 : removal of edges (9,11) and (10,12) would disconnect the graph 2 8 9 11 5 1 12 3 7 10 4 6
Definition and basic concepts • A cliqueC is a subset of vertices such that an edge exists between every pair of vertices in C • The induced subgraph G[C] is a complete graph • A clique is maximal if it is not a subset of any larger clique • A clique is maximum if there are no larger cliques in the graph • A subset of vertices I is called an independent set (also called a stable set) if for every pair of vertices in I, (i, j) is not an edge • Induced subgraph G[I] is edgeless • An independent set is maximal if it is not a subset of any larger independent set • An independent set is maximum if there are no larger independent sets in the graph
Definition and basic concepts • Example: maximal clique • {1,2,3} is a maximal clique 2 8 9 11 5 1 12 3 7 10 4 6
Definition and basic concepts • Example: maximum clique • {7,8,9,10} is the maximum clique 2 8 9 11 5 1 12 3 7 10 4 6
Definition and basic concepts • Example : maximal independent set • I={3,7,11} is a maximal independent set as there is no larger independent set containing it 2 8 9 11 5 1 12 3 7 10 4 6
Definition and basic concepts • Example: maximum independent set • The set {1,4,5,10,11} is a maximum independent set, one of largest cardinality in the graph 2 8 9 11 5 1 12 3 7 10 4 6
Definition and basic concepts • C is a clique in G if and only if C is an independent set in the complement graph • Clique number ω(G) is the cardinalities of a maximum clique • Independence number α(G) is the cardinalities of a maximum independent set
Definition and basic concepts Algorithm for maximal independent set
Definition and basic concepts • Algorithm for maximal independent set 2 5 1 3 7 4 6
Definition and basic concepts • A dominating set is a set of vertices such that every vertex in the graph is either in this set or has a neighbour in this set • Dominating set is minimal if it contains no proper subset which is dominating • Dominating set is a minimum dominating set if it is of the smallest cardinality • Cardinality of a minimum dominating set is called the domination number γ(G) of a graph
Definition and basic concepts D = {7, 11, 3} is a minimal and minimum dominating set 2 8 9 11 5 1 12 3 10 7 4 6
Definition and basic concepts • A connected dominating set is one in which the subgraph induced by the dominating set is connected • An independent dominating set is one in which the dominating set is also independent
Network clustering problem • Given a graph G=(V,E), find subsets (not necessarily disjoint) {V1,...,Vr} of V such that V= UVi i=1,…,r such that • Each subset is a cluster modelled by structures such as cliques or other distance and diameter-based models • The model used as a cluster represents the cohesiveness required of the cluster
Network clustering problem • The clustering models can be classified • By the constraints on relations between clusters (clusters may be disjoint or overlapping) • The objective function used to achieve the goal of clustering (minimizing the number of clusters or maximizing the cohesiveness) • When clusters are required to be disjoint • {V1,...,Vr} is cluster -partition Exclusive clustering • When clusters are allowed to overlap • {V1,...,Vr} is a cluster-over Overlapping clustering
Network clustering problem • Assume that there is a measure of cohesiveness of the cluster that can be varied for a graph G define two types of optimization problems: • Type I: Minimize the number of clusters while ensuring that every cluster formed has cohesiveness over a prescribed threshold • Example: The problem of clustering an incomplete graph with cliques used as clusters and the objective of minimizing the number of clusters
Network clustering problem • Type II: Maximize the cohesiveness of each cluster formed, while the number of clusters is K (the last requirement may be relaxed by setting K be infinite ) • Example: assume that G has non-negative edge weights w, for a cluster Vi let Ei denote the edges in the subgraph induced by Vi • Use w as a dissimilarity measure (distance) • For example, w(Ei)=∑e in Ew(e) is meaningful measures of cohesiveness • can be used to formulate a Type II clustering problem • We will refer to problems as Type I and Type II based on their objective
Hierarchical clustering • After performing clustering, we can abstract the graph G0 to a graph • G1 = (V1, E1) as the followings; • There exists a vertex vi1 in V1 for every subset (cluster) Vi0 • There exists an edge between vi1 and vj1 if and only if there exist a vertex x in the cluster Vi and a vertex y in cluster Vj • In other words: if any two vertices from different clusters have an edge between them in the original graph G0 clusters containing them are made adjacent in G1
Hierarchical clustering • We can recursively cluster the abstracted graph G1 in a similar fashion to obtain a multilevel hierarchy • Process is called hierarchical clustering • Example • Following subsets form clusters in this graph: C1={7,8,9,10},C2={1,2,3},C3={4,6},C4={11,12},C5={5} • Given the clusters of the example graph G we can construct an abstracted graph G’
Clique-based clustering • Natural choice for a highly cohesive cluster • Cliques have • Minimum possible diameter • Maximum connectivity • Maximum possible degree for each vertex • Given an arbitrary graph • Type I approach tries to partition it into (or cover it using) minimum number of cliques • Type II approaches usually work with a weighted complete graph and hence every partition of the vertex set is a clique partition
Clique-based clustering • Minimum clique partitioning • Type I clique partitioning and clique covering problems are both NP-hard [Garey79] • Heuristic approaches are preferred for large graphs • Note that clique-partitioning and clique-covering problems are closely related • Minimum number of clusters produced in clique covering and partitioning are the same
Clique-based clustering • Clique-partitioning and Clique-covering problems • p: optimal number of covering • c: optimal number of partitioning • p c, since every clique partition is also a cover. • c p, since any vertex v present in multiple clusters causing overlaps can be removed from all but one of the clusters and one less overlap. Repeating this until it results in a clique partition with the same number of clusters c.
Clique-based clustering Simple heuristic for clique partitioning
Clique-based clustering • Example 2 8 9 11 5 1 12 3 10 7 4 6
Clique-based clustering • Example
Clique-based clustering • Example
Clique-based clustering • Min-Max k-Clustering • A Type II clique partitioning problem with min-max objective • Consider a weighted complete graph G=(V,E) with weights w(e1)≤w(e2)≤…≤w(em) , m=n(n−1)/2 • Partition the graph into no more than k cliques s.t. the maximum weight of an edge between two vertices inside a clique is minimized • In other words, if V2,...,Vk is the clique partition, then we wish to minimize maxi=1…kmaxu,v in Viw(u,v) • The weight w(i,j) can be thought of as a measure of dissimilarity • Larger w(i,j) means more dissimilar i and j are • Problem tries to cluster the graph into at most k cliques such that the maximum dissimilarity between any two vertices inside a clique is minimized
Clique-based clustering • Min-Max k-Clustering • Given any graph G=(V,E), the required edge weighted complete graph G can be obtained in different ways using meaningful measures of dissimilarity • The weight w(i,j) could be dij, the shortest distance of i and j in G • The weight could be k(i,j) and k’(i,j) (minimum number of vertices and edges that need to be removed from G to disconnect i and j) • Since these are measures of similarity we could obtain the required weights as w(i,j)=|V|−k(i,j) or w(i,j)=|E|−k’(i,j)
Clique-based clustering • Bottleneck graph • Bottleneck graph of a weighted graph G=(V,E) is defined for a given number c as follows • G(c)=(V, Ec) where Ec={e E|w(e)≤c} • Bottleneck graph G(c) contains only those edges with weight at most c • Example: Complete weighted graph G and its bottleneck graphs G(1) and G(2) for weights 1 and 2, respectively
Clique-based clustering • Bottleneck heuristic for the min-max k-clustering problem
Clique-based clustering • Procedure bottleneck(…) returns the bottleneck graph G(…) • MIS(…) is an arbitrary procedure for finding a maximal independent set (MIS) in G • This algorithm will be optimal if we manage to find a maximum independent set (one of largest size) in every iteration • Problem is NP-hard • We have to restrict ourselves to finding MIS using heuristic approaches such as the greedy approach described earlier to have a polynomial time algorithm
Clique-based clustering • Example • Clustering output of the bottleneck min-max k-clustering algorithm with k=2 for following graph
Clique-based clustering Result
Center-based clustering • In center-based clustering models, the elements of a cluster are determined based on their similarity with the cluster’s center (or cluster-head) • Center-based clustering algorithms usually consist of two steps • First, an optimization procedure is used to determine the cluster-heads • Second, the cluster-heads are then used to form clusters around them
Center-based clustering • Clustering with dominating sets • Minimum dominating set and related problems provide a modelling tool for centre-based clustering of Type I • The minimum dominating set problem is NP-hard heuristic approaches and approximation algorithms are used to find a small dominating set • If D denotes a dominating set, then for each vertex v in D the closed neighbourhood N[v] forms a cluster • By the definition of domination, every vertex not in the dominating set has a neighbour in it and hence is assigned to some cluster
Center-based clustering • Each v in D is called a cluster-head and the number of clusters that result is exactly the size of the dominating set • Minimizing the size of the dominating set minimize the number of clusters produced resulting in a Type I clustering problem • This approach results in a cluster cover since the resulting clusters need not to be disjoint
Center-based clustering • Each cluster has diameter at most two as every vertex in the cluster is adjacent to its cluster-head and the cluster-head is “similar” to all the other vertices in its cluster • However, neighbours of the cluster-head may be poorly connected among themselves • Some post-processing may be required as a cluster formed in this fashion from an arbitrary dominating set could completely contain another cluster • Clustering with dominating sets is especially suited for clustering protein interaction networks • To reveal groups of proteins that interact through a central protein which could be identified as a cluster-head in this method
Center-based clustering • Independent Dominating Sets • Finding a maximal independent set results also in a minimal independent dominating set • Can be used in clustering the graph • Here, no cluster formed can contain another cluster completely, as the cluster-heads are independent and different
Center-based clustering • Greedy algorithm for minimal independent dominating sets • Proceeds by adding a maximum degree vertex to the current independent set and then deleting that vertex along with its neighbours • Greedy because it adds a maximum degree vertex so that a larger number of vertices are removed in each iteration, yielding a small independent dominating set • This is repeated until no more vertices exist