250 likes | 393 Views
Mining Coherent Dense Subgraphs across Multiple Biological Networks. Vahid Mirjalili CSE 891. Motivation: Finding patterns across multiple networks, to identify biological modules, and function prediction Current algorithms are too costly Developed a novel algorithm: CODENSE
E N D
Mining Coherent Dense Subgraphs across Multiple Biological Networks VahidMirjalili CSE 891
Motivation: • Finding patterns across multiple networks, to identify biological modules, and function prediction • Current algorithms are too costly • Developed a novel algorithm: CODENSE • Scalable in number and size • Adjustable based on the exact or approximate pattern mining
Clustering can detect meaningful biological modules • e.g. a dense protein interaction sub-network may correspond to a protein complex • Dense co-expression sub-network may represent a co-expression cluster • Biological modules are expected to be active across multiple conditions • One idea: aggregate all the networks and identify dense sub-graphs in the aggregated network • Risk of false positive detection
Aggregated graph:False positive in the aggregated graph • Adding six graphs together, and deleting the edges that occur less than 3 times resulting summary graph
Solution to the false-positive summary-graph • Frequent sub-graphs • Mine the dense sub-graphs directly in each original network • A sub-graph is frequent if it occurs in multiple times in a set of graphs • In biological networks, each gene occur only once in a graph no isomorphism problem
Frequent dense sub-grpah • A frequent dense sub-graph doesn’t show accurate information • Some edges in the frequent sub-graph shown above do not occur in the original set • It is more meaningful to divide this to two sub-graphs
Coherent Dense Sub-graphs • All edges in a coherent sub-graphs should have correlated occurrences in the original graph set • CODENSE divides the networks into 2 meta-graphs and perform clustering on these two graphs only (instead of individual networks) • CODENSE can distinguish the two modules • Good scalability • Discovery of overlapping clusters
Overlapping Sub-graphs • Partition-based clustering algorithms fail to identify overlapping sub-graphs • Mining Overlapping Dense Sub-graphs (MODES)
Application • Identify frequent co-expression clusters across multiple microarray datasets Microarray dataset: • Un-weighted, undirected graph • Each gene represents a node • Two genes are connected by an edge if they show high expression correlation • A densely connected sub-graph tight co-expression cluster • Clusters from a single microarray dataset include spurious links, and may not be homogenous in function and regulation
Problem Formulation • A relation graph contains n simple graphs, such as • A common vertex set V is shared by the graphs • Support(G): the numbers of graphs in a relation graph dataset (D) • A graph is frequent if support(G) > threshold • Summary graph: is an un-weighted graph extracted from D, where an edge exists only if it occurs in more than k graphs in D
Problem Formulation • Edge Support Vector: is the weight of edge e in graph i (for an un-weighted graph it would be 0 or 1)
Second-Order Graph: where each node represents an edge from the relation graph dataset (D) and an edge between nodes u and v exists if w(u) and w(v) are highly correlated • For efficiency, only construct the S graph for a sub-graph of the summary graph
Coherent Graph: a sub-graph extracted from the summary graph is coherent if • All its edges have support > k • Its second-order graph is dense • Graph Density: m: number of edges n: n umber of nodes
Two facts: • If a frequent sub-graph is dense, then it must be dense in the summary graph as well, but the reverse way doesn’t hold true always • If a sub-graph is coherent (its edges have high correlation across the dataset), then its second-order sub-graph is dense
Aggregate the graphs into a summary graph • Eliminate infrequent edges
MODES: Mining Overlapping DEnseSubgraphs • Developed based on HCS: Highly Connected Sub-graphs • Can efficiently identify dense sub-graphs • Can mine overlapping sub-graphs • Two approaches: • Minimum cut • Normalized cut (Shi, Malik 2000) • Apply the normalized cut in the initial steps of HCS algorothm, then if the size of partitions is small proceed with minimum cut
CODENSE analysis • Simplify the identification of coherent dense sub-graphs across n graphs into mining in two special graphs: summary graph + second-order graph • Can mine network modules • Can mine both exact and approximate patterns (by modifying the similarity threshold) • Can be extended to weighted graph (using Pearson correlation instead of Euclidean distance )
Experimental Study: co-expression network • 39 yeast microarray datasets • 6661 genes • Calculate the Pearson correlation between the expression levels (r) • Construct the relation graph, (connectivity of two genes determined by the Pearson correlation) n: number of measurements
Create the summary graph , while removing edges that occur less than 6 times across 39 graphs • Apply MODES to identify dense sub-grahs: sub( ) with cutoff density d1 • For each sub( ), construct the second-order graph S • Apply MODES to S to identify sub-grpahs with density > d2 • Transform the edges vertices, and apply MODES again to identify the dense sub-graphs with density > d3
Functional Module Discovery:MODES vs CODENSE • A cluster is considered functionally homogenous if: • The functional homogeneity modeled by hypergeometric distribution shall be significant at α=0.01 • At least 40% of its memebr genes belong to a specific G.O. functional category • MODES identified 366 clusters, but only 151 were functionally homogenous (42%) • CODENSE identified 770 clusters, which 76% of those were homogenous • Improvement is due to second-order graph by eliminating edges which do not show co-occurrence across all networks
Example of MODES false positive: MODES identified 5 genes: MSF1, PHB1, CBP4, NDI1, SCO2 which are not functionally homogenous Protein biosynthesis replicative cell aging mitochondrial electron transfer
Functional prediction: • CODENSE identified this 6-nodes sub-graph • 5 genes belong to “protein biosynthesis” category • Predict: ASC1 must be involved in protein biosynthesis as well Test with 448 known genes: 50% accuracy