1 / 45

Graph Layout in Cellular Networks

Graph Layout in Cellular Networks. www.cytoscape.org. Task: visualize cellular interaction data. e.g. protein interaction data (undirected): nodes – proteins edges – interactions metabolic pathways (directed) nodes – substances edges – reactions

Download Presentation

Graph Layout in Cellular Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graph Layout in Cellular Networks www.cytoscape.org Bioinformatics III

  2. Task: visualize cellular interaction data e.g. protein interaction data (undirected): nodes – proteins edges – interactions metabolic pathways (directed) nodes – substances edges – reactions regulatory networks (directed): nodes – transcription factors + regulated proteins edges – regulatory interaction co-localization (undirected) : nodes – proteins edges – co-localization information homology (undirected/directed) nodes – proteins edges – sequence similarity (BLAST score) Bioinformatics III

  3. Visualisation: intuitive approach to understand graphs Graph like structures are pervasive: - route maps of airline companies - infrastructure of computer networks - the relationship between people who work in a same company etc. - cellular interactions ... One way to understand the information coded in these graphs is to draw graphical representations of them. Since drawing by hand is tedious and error-prone, it is natural to expect computers to draw graphs automatically, assigning spatial coordinates to nodes and connecting them with edges. Graphs, such as the flight route maps, are not hard to draw since the precise locations of the nodes (cities) are already given. For other graphs, such information is not available and computers need to determine where to plot the nodes and how to draw the edges that connect the nodes. http://www.it.usyd.edu.au/~aquigley/3dfade/ Bioinformatics III

  4. Force-directed algorithm for graph layout Various graph layout algorithms have been developed to solve this visualisation task. 20 years ago, Peter Eades proposed a graph layout heuristic [A heuristic for graph drawing. Congressus Numerantium, 42:149-160, 1984] which is called the ``Spring Embedder'' algorithm. Edges are replaced by springs and vertexes are replaced by rings that connect the springs. A layout can be found by simulating the dynamics of such a physical system. This method and other methods, which involve similar simulations to compute the layout, are called ``Force Directed'' algorithms. http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html Bioinformatics III

  5. Force-directed algorithm The edges can be modeled as gravitational (or electrostatic) attraction and all nodes have an electrical repulsion between them. It is also possible for the system to simulate unnatural forces acting on the bodies, which have no direct physical analogy, for example the use of a logarithmic distance measure rather than Euclidean. http://www.it.usyd.edu.au/~aquigley/3dfade/ Bioinformatics III

  6. Force-directed algorithm Because of the underlying analogy to a physical system, the force directed graph layout methods tend to meet various aesthetic standards, such as - efficient space filling, - uniform edge length (when equal weights and repulsions are used) - symmetry and the - capability of rendering the layout process with smooth animation (visual continuity). Having these nice features, the force directed graph layout has become the ``work horse'' of layout algorithms. It has been successfully adapted to many domains with variations of implementation. http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html Bioinformatics III

  7. Scaling Force directed layout methods commonly have computational scaling problems. When there are more than a few thousand vertexes in the graph, the running time of the layout computation can become unacceptable. This is caused by the fact that in each step of the simulation, the repulsive force between each pair of unconnected vertexes needs to be computed, costing a running time of O(0.5  V2 – E). Here V is the number of vertexes and E is the number of edges in the graph. This complexity is hard to escape for general graphs without hierarchical structure. http://www.hpc.unm.edu/~sunls/research/treelayout/node1.html Bioinformatics III

  8. Protein interaction graphs Most protein interaction data have the following characteristics: (1) When visualized as a graph, the data yields a disconnected graph with many connected components (2) The data yields a nonplanar graph with a large number of edge crossings that cannot be removed in a 2D drawing (3) #interactions varies widely within the same set of data – p(k) (4) data often contains protein interactions corresponding to self loops  demands robust algorithm. Ju et al. Bioinformatics 19, 317 (2003) Bioinformatics III

  9. InterViewer: Example of force-directed layout algorithm InterViewer does not place initial nodes randomly, but on the surface of a sphere. Fixed # of iterations. The original algorithm has complexity O(N2) per timestep with N # of nodes. When using multipole-methods, this can be reduced to O(N logN) Time may also be saved by introducing a cut-off, e.g. only computing interactions with the next neighbor cells. Update neighbor list infrequently. Ju et al. Bioinformatics 19, 317 (2003) Bioinformatics III

  10. Application for protein interaction graphs Visualisation of the MIPS interaction data. In 3D, this graph contains no edge-crossings. Ju et al. Bioinformatics 19, 317 (2003) Bioinformatics III

  11. Aim: analyze and visualize homologies between the protein universe :-) 50 genomes  145579 proteins  21  109 BLASTP pairwise sequence comparisons. Expect that fusion proteins („Rosetta Stone proteins“) will link proteins of related function. Need to visualize extremely large network! Develop stepwise scheme. Bioinformatics III

  12. LGL (1) separate original network into connected sets (2) generate coordinates for each node in each connected set (using force-directed layout algorithm and a recipe for the sequential lay out of nodes guided by a minimum spanning tree of the network). (3) integrate connected sets into one coordinate system via a funnel process: the connected sets are sorted in descending size by the number of vertices. The first connected set is placed at the bottom of a potential funnel and other sets are placed one at a time on the rim of the potential funnel and allowed to fall towards the bottom where they are frozen in space upon collision with the previous sets. We concentrate on step (2) in the following Adai et al. J. Mol. Biol. 340, 179 (2004) Bioinformatics III

  13. Minimum Spanning Tree Given: undirected graph G = (V,E) wherefor each edge (u,v) E exists a weight w(u,v) specifying the cost to connect u and v. Find an acyclic graph T  E that connects all of the nodes and whose total weight is minimized. Popular algorithms by Kruskal and Prim. Both are greedy algorithms making the best choice at the moment.  no guarantee to find the best global solution [Cormen] Bioinformatics III

  14. Kruskal’s algorithm Consider edges in sorted order by weight. The arrow points to the edge under consideration at each step. [Cormen] Bioinformatics III

  15. Kruskal’s algorithm (II) Running time  O(E log V) [Cormen] Bioinformatics III

  16. Intuitive description of LGL Successive iterations of the layout. The MST determines the oder of placement of the nodes. The root node could be chosen randomly or based on its centrality in the network (e.g. minimizing the sum of distances to all other nodes). All other nodes are assigned a level according to their edge-based distance in the MST from the root node. Level one vertices (red circles) are placed randomly on a sphere around the root node (black circle). The system is allowed to iterate through time satisfying attractive and repulsive forces until at rest. Level two nodes (blue circles) are placed randomly on spheres directed away from the current layout. Again, the system is allowed to evolve through time till at rest. This process is iterated for the entire graph. Adai et al. J. Mol. Biol. 340, 179 (2004) Bioinformatics III

  17. What is the role of fusion proteins? A protein homology map summarizes the results of billions of sequence comparisons by modeling the proteins as vertices in a network, and the statistically significant sequence similarities as edges connecting the relevant proteins. In this manner, proteins within a sequence family (such as A, A′, A″, and AB; or B, B′ and AB) are all or mostly connected to each other, forming a cluster in the map. Fusion proteins (such as AB) serve to connect their component proteins' families. The structure of the resulting map reflects historic genetic events, such as gene fusions, fissions, and duplications, which are responsible for producing the modern-day genes. The map simultaneously represents homology relationships (edges), remote homologies (proteins not directly connected but in the same cluster), and non-homologous functional relationships (adjacent clusters and clusters linked by fusion proteins). Adai et al. J. Mol. Biol. 340, 179 (2004) Bioinformatics III

  18. LGL Algorithm for very large biological networks The complete protein homology map. A layout of the entire protein homology map; a total of 11,516 connected sets containing 111,604 proteins (vertices) with 1,912,684 edges. The largest connected set is shown more clearly in the inset and is enlarged further on the right side. Adai et al. J. Mol. Biol. 340, 179 (2004) Bioinformatics III

  19. Map of gene function emerges from ~21 billion gene sequence comparisons. Proteins are drawn as points, with lines connecting proteins with similar sequences, and are arranged so that homologous proteins are adjacent in the Figure. The size of each cluster is proportional to the number of proteins in that sequence family. Fusion proteins force their component proteins' respective families to be close together in the Figure, and thereby serve to organize the proteins in the map according to their functions. The resulting broad trends of protein function are labeled, as are several of the most extensive sequence families. A–C indicate specific regions that are magnified later. Only the greatest connected network component is drawn, containing 30,727 proteins (vertices) and 1,206,654 significant sequence similarities (edges), and representing ~4 billion sequence comparisons. Adai et al. J. Mol. Biol. 340, 179 (2004) Bioinformatics III

  20. Functionally related gene families form adjacent clusters Three examples illustrate spatial localization of protein function in the map, specifically A, the linkage of the tryptophan synthase  family to the functionally coupled but non-homologous  family by the yeast tryptophan synthase  fusion protein, B, protein subunits of the pyruvate synthase and alpha-ketoglutarate ferredexin oxidoreductase complexes C, metabolic enzymes, particularly those of acetyl CoA and amino acid metabolism. Adai et al. J. Mol. Biol. 340, 179 (2004) Bioinformatics III

  21. Colocalization Neighboring proteins tend to be in the same cellular system. The tendency for proteins to operate in the same cellular system, as defined by the percentage of matching assignments into the 18 COG database pathways, is plotted against the spatial separation in multiples of a typical cluster size. The functional similarity decays exponentially with distance proportional to the function e−0.26d where d is a typical cluster diameter. Adai et al. J. Mol. Biol. 340, 179 (2004) Bioinformatics III

  22. Comparison with other layout maps A comparison of LGL with map layouts produced by other algorithms. The layout of the protein homology map by LGL (A) is contrasted with the layout of the same network by the spring-force algorithm only, lacking the minimal spanning tree calculation and iterative layout procedure (B), and with the layout by the approach of InterViewer (C). Interviewer collapses equivalent nodes into single nodes, thereby simplifying the graph, and is one of the few available graph layout programs that scales to such large networks. The layout from LGL reveals more of the internal graph structure than the other approaches tested. Adai et al. J. Mol. Biol. 340, 179 (2004) Bioinformatics III

  23. Modularity in molecular networks? A functional module is, by definition, a discrete entity whose function is separable from those of other modules. This separation depends on chemical isolation, which can originate from spatial localization or from chemical specificity. E.g. a ribosome concentrates the reactions involved in making a polypeptide into a single particle, thus spatially isolating its function. A signal transduction system is an extended module that achieves its isolation through the specificity of the initial binding of the chemical signal to receptor proteins, and of the interactions between signalling proteins within the cell. Hartwell et al. Nature 402, C47 (1999) Bioinformatics III

  24. Modularity in molecular networks Modules can be insulated from or connected to each other. Insulation allows the cell to carry out many diverse reactions without cross-talk that would harm the cell. Connectivity allows one function to influence another. The higher-level properties of cells, such as their ability to integrate information from multiple sources, will be described by the pattern of connections among their functional modules. Hartwell et al. Nature 402, C47 (1999) Bioinformatics III

  25. Organization of large-scale molecular networks • Organization of molecular networks revealed by large-scale experiments: • power-law distribution ; P(k)  exp- • similar distribution of the node degree k (i.e. the number of edges of a node) • small-world property (i.e. a high clustering coefficient and a small shortest path between every pair of nodes) • anticorrelation in the node degree of connected nodes (i.e. highly interacting nodes tend to be connected to low-interacting ones) • These properties become evident when hundreds or thousands of molecules and their interactions are studied together. • On the other end of the spectrum: recently discovered motifs that consist of 3-4 nodes. Bioinformatics III

  26. Mesoscale properties of networks Most relevant processes in biological networks correspond to the mesoscale (5-25 genes or proteins) not to the entire network. However, it is computationally enormously expensive to study mesoscale properties of biological networks. e.g. a network of 1000 nodes contains 1  1023 possible 10-node sets. Spirin & Mirny analyzed combined network of protein interactions with data from CELLZOME, MIPS, BIND: 6500 interactions. Bioinformatics III

  27. Identify connected subgraphs The network of protein interactions is typically presented as an undirected graph with proteins as nodes and protein interactions as undirected edges. Aim: identify highly connected subgraphs (clusters) that have more interactions within themselves and fewer with the rest of the graph. A fully connected subgraph, or clique, that is not a part of any other clique is an example of such a cluster. In general, clusters need not to be fully connected. Measure density of connections by where n is the number of proteins in the cluster and m is the number of interactions between them. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  28. (method I) Identify all fully connected subgraphs (cliques) Generally, finding all cliques of a graph is an NP-hard problem. Because the protein interaction graph is sofar very sparse (the number of interactions (edges) is similar to the number of proteins (nodes), this can be done quickly. To find cliques of size n one needs to enumerate only the cliques of size n-1. The search for cliques starts with n = 4, pick all (known) pairs of edges (6500  6500 protein interactions) successively. For every pair A-B and C-D check whether there are edges between A and C, A and D, B and C, and B and D. If these edges are present, ABCD is a clique. For every clique identified, ABCD, pick all known proteins successively. For every picked protein E, if all of the interactions E-A, E-B, E-C, and E-D are known, then ABCDE is a clique with size 5. Continue for n = 6, 7, ... The largest clique found in the protein-interaction network has size 14. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  29. (I) Identify all fully connected subgraphs (cliques) These results include, however, many redundant cliques. For example, the clique with size 14 contains 14 cliques with size 13. To find all nonredundant subgraphs, mark all proteins comprising the clique of size 14, and out of all subgraphs of size 13 pick those that have at least one protein other than marked. After all redundant cliques of size 13 are removed, proceed to remove redundant twelves etc. In total, only 41 nonredundant cliques with sizes 4 - 14 were found. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  30. (method II) Superparamagnetic Clustering (SPC) SPC uses an analogy to the physical properties of an inhomogenous ferromagnetic model to find tightly connected clusters on a large graph. Every node on the graph is assigned a Potts spin variable Si = 1, 2, ..., q. The value of this spin variable Siperforms thermal fluctuations, which are determined by the temperature T and the spin values on the neighboring nodes. Energetically, 2 nodes connected by an edge are favored to have the same spin value. Therefore, the spin at each node tends to align itself with the majority of its neighbors. When such a Potts spin system reaches equilibrium for a given temperature T, high correlation between fluctuating Siand Sjat nodes i and j would indicate that nodes i and j belong to the same cluster. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  31. (II) Superparamagnetic Clustering (SPC) The protein-interaction network is represented by a graph where every pair of interacting proteins is an edge of length 1. The simulations are run for temperatures ranging from 0 to 1 in units of the coupling strength. The network splits two monomers at temperatures between 0.7 and 0.8, whereas larger clusters only exist for temperatures between 0.1 and 0.7. Clusters are recorded at all values temperature. The overlapping clusters are then merged and redundant ones are removed. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  32. (method III) Monte Carlo Simulation Use MC to find a tight subgraph of a predetermined number of nodes M. At time t = 0, a random set of M nodes is selected. For each pair of nodes i,j from this set, the shortest path Lijbetween i and j on the graph is calculated. Denote the sum of all shortest paths Lijfrom this set as L0. At every time step one of M nodes is picked at random, and one node is picked at random out of all its neighbors. The new sum of all shortest paths, L1, is calculated if the original node were to be replaced by this neighbor. If L1 < L0, accept replacement with probability 1. If L1 > L0, accept replacement with probability where T is the effective temperature. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  33. (III) Monte Carlo Simulation Every tenth time step an attempt is made to replace one of the nodes from the current set with a node that has no edges to the current set to avoid getting caught in an isolated disconnected subgraph. This process is repeated (i) until the original set converges to a complete subgraph, or (ii) for a predetermined number of steps, after which the tightest subgraph (the subgraph corresponding to the smallest L0) is recorded. The recorded clusters are merged and redundant clusters are removed. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  34. Optimal temperature in MC simulation For every cluster size there is an optimal temperature that gives the fastest convergence to the tightest subgraph. Time to find a clique with size 7 in MC steps per site as a function of temperature T. The region with optimal temperature is shown in Inset. The required time increases sharply as the temperature goes to 0, but has a relatively wide plateau in the region 3 < T < 7. Simulations suggest that the choice of temperature T  M would be safe for any cluster size M. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  35. Comparison of SPC and Monte Carlo methods Comparison of clusters found with SPC (blue) and MC simulation (red). Reasonable overlap (ca. one third of all clusters are found by both methods) – but both methods seem complementary. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  36. Comparison of SPC and Monte Carlo methods The SPC method is best at detecting high-Q value clusters with relatively few links with the outside world. An example is the TRAPP complex, a fully connected clique of size 10 with just 7 links with outside proteins. This cluster was perfectly detected by SPC, whereas the MC simulation was able to find smaller pieces of this cluster separately rather than the whole cluster. By contrast, MC simulations are better suited for finding very „outgoing“ cliques. The Lsm complex, a clique of size 11, includes 3 proteins with more interactions outside the complex than inside. This complex was easily found by MC, but was not detected as a stand-alone cluster by SPC. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  37. Merging Overlapping Clusters A simple statistical test shows that nodes which have only one link to a cluster are statistically insignificant. Clean such statistically insignificant members first. Then merge overlapping clusters: For every cluster Aifind all clusters Akthat overlap with this cluster by at least one protein. For every such found cluster calculate Q value of a possible merged cluster AiU Ak . Record cluster Abest(i) which gives the highest Q value if merged with Ai. After the best match is found for every cluster, every cluster Ai is replaced by a merged cluster AiU Abest(i) unless AiU Abest(i) is below a certain threshold value for QC. This process continues until there are no more overlapping clusters or until merging any of the remaining clusters witll make a cluster with Q value lower than QC. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  38. Statistical significance of complexes and modules Number of complete cliques (Q = 1) as a function of clique size enumerated in the network of protein interactions (red) and in randomly rewired graphs (blue, averaged >1,000 graphs where number of interactions for each protein is preserved). Inset shows the same plot in log-normal scale. Note the dramatic enrichment in the number of cliques in the protein-interaction graph compared with the random graphs. Most of these cliques are parts of bigger complexes and modules. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  39. Statistical significance of complexes and modules Distribution of Q of clusters found by the MC search method. Red bars: original network of protein interactions. Blue cuves: randomly rewired graphs. Clusters in the protein network have many more interactions than their counterparts in the random graphs. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  40. Architecture of protein network Fragment of the protein network. Nodes and interactions in discovered clusters are shown in bold. Nodes are colored by functional categories in MIPS: red, transcription regulation; blue, cell-cycle/cell-fate control; green, RNA processing; and yellow, protein transport. Complexes shown are the SAGA/TFIID complex (red), the anaphase-promoting complex (blue), and the TRAPP complex (yellow). Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  41. Discovered functional modules Examples of discovered functional modules. (A) A module involved in cell-cycle regulation. This module consists of cyclins (CLB1-4 and CLN2) and cyclin-dependent kinases (CKS1 and CDC28) and a nuclear import protein (NIP29). Although they have many interactions, these proteins are not present in the cell at the same time. (B) Pheromone signal transduction pathway in the network of protein–protein interactions. This module includes several MAPK (mitogen-activated protein kinase) and MAPKK (mitogen-activated protein kinase kinase) kinases, as well as other proteins involved in signal transduction. These proteins do not form a single complex; rather, they interact in a specific order. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  42. Architecture of protein network Comparison of discovered complexes and modules with complexes derived experimentally (BIND and Cellzome) and complexes catalogued in MIPS. Discovered complexes are sorted by the overlap with the best-matching experimental complex. The overlap is defined as the number of common proteins divided by the number of proteins in the best-matching experimental complex. The first 31 complexes match exactly, and another 11 have overlap above 65%. Inset shows the overlap as a function of the size of the discovered complex. Note that discovered complexes of all sizes match very well with known experimental complexes. Discovered complexes that do not match with experimental ones constitute our predictions. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  43. Robustness of clusters found Model effect of false positives in experimental data: randomly reconnect, remove or add 10-50% of interactions in network. Cluster recovery probability as a function of the fraction of altered links. Black curves correspond to the case when a fraction of links are rewired. Red, removed; green, added. Circles represent the probability to recover 75% of the original cluster; triangles represent the probability to recover 50%. Noise in the form of removal or addions lf links has less deteriorating effect than random rewiring. About 75% of clusters can still be found when 10% of links are rewired. Spirin, Mirny, PNAS 100, 12123 (2003) Bioinformatics III

  44. Summary Here: analysis of meso-scale properties demonstrated the presence of highly connected clusters of proteins in a network of protein interactions. Strong support for suggested modular architecture of biological networks. Distinguish 2 types of clusters: protein complexes and dynamic functional modules. Both complexes and modules have more interactions among their members than with the rest of the network. Dynamic modules are elusive to experimental purification because they are not assembled as a complex at any single point in time. Computational analysis allows detection of such modules by integrating pairwise molecular interactions that occur at different times and places. However, computational analysis alone, does not allow to distinguish between complexes and modules or between transient and simultaneous interactions. Bioinformatics III

  45. Summary Most of the discovered complexes and modules come from traditional studies, rather than from large-scale experiments. This suggests that although large-scale proteomic studies provide a wealth of protein interaction data, the scarcity of the data (and its comtamination with false positives) makes such studies less valuable for identification of functional modules. Bioinformatics III

More Related