270 likes | 446 Views
1 Scale-free networks: mathematical properties. Random graphs : classical field in graph theory. Well studied analytically and numerically. Scale-free networks : quite new. Properties were mostly studied numerically and heuristically (sofar). Nice review (suggested reading of today):
E N D
1 Scale-free networks: mathematical properties Random graphs: classical field in graph theory. Well studied analytically and numerically. Scale-free networks: quite new. Properties were mostly studied numerically and heuristically (sofar). Nice review (suggested reading of today): Graph Theory Approaches to Protein Interaction Data Analysis, Nataša Pržulj. Bioinformatics III
Erdös-Renyi model n nodes (vertices) joined by edges that have been chosen and placed between pairs of nodes uniformly at random. Gn,p : each possible edge in the graph on n nodes is present with probability p and absent with probability 1 – p. Average number of edges in Gn,p: Each edge connects two vertices average degree of a vertex: Bioinformatics III
Erdös-Renyi model: components Erdös and Renyi studied how the expected topology of a random graph with n nodes changes as a function of the number of edges m. When m is small, the graph is likely fragmented into many small connected components having vertex sets of size at most O(log n). As m increases the components grow at first by linking to isolated nodes, and later by fusing with other components. A transition happens at m = n/2, when many clusters cross-link spontaneously to form a unique largest component called the giant component. Its vertex set size is much larger than the vertex set sizes of any other components. It contains O(n) nodes, while the second largest component contains O(log n) nodes. In statistical physics, this phenomenon is called percolation. Bioinformatics III
Erdös-Renyi model: shortest path length • The shortest path length between any pairs of nodes in the giant component grows like log n. • Therefore, these graphs are called „small worlds“. • The properties of random graphs have been studied very extensively. • Literature: B. Bollobas. Random Graphs. Academic, London, 1985. • However, random graphs are no adequate models for real-world networks because • real networks appear to have a power-law degree distribution, • (while random graphs have Poisson distribution) and • (2) real networks have strong clustering while the clustering coefficient of a random graph is C = p, independent of whether two vertices have a common neighbor. Bioinformatics III
Generalized Random Graphs • Aim: allow a power-law degree distribution in a graph while leaving all other aspects as in the random graph model. • Given a degree sequence (e.g. power-law distribution) one can generate a random graph by assigning to a vertex i a degree kifrom the given degree sequence. Then choose pairs of vertices uniformly at random to make edges so that the assigned degrees remain preserved. When all degrees have been used up to make edges, the resulting graph is a random member of the set of graphs with the desired degree distribution. Problem: method does not allow to specify clustering coefficient. On the other hand, this property makes it possible to exactly determine many properties of these graphs in the limit of large n. • E.g. almost all random graphs with a fixed degree distribution and no nodes of degree smaller than 2 have a unique giant component. Bioinformatics III
Barabasi scale-free model Input (n0, m, t) where n0 is the initial number of vertices, m (m n0) is the number of added edges every time one new vertex is added to the graph, and t is the number of iterations. Algorithm a) Start with n0 isolated nodes. b) Every time we add one new node v, m edges will be linked to the existing nodes from v with a preferential attachment probability where kiis the number of links at the i-th node. Eventually, the graph has (n0 + t) nodes and (mt) edges. Problem of „pure“ mathematicians with this algorithm: how to start from n0 = 0? Bioinformatics III
Properties of Barabasi-Albert scale-free model P(k) k-with = 3. Real networks often show 2.1 – 2.4 Observation: if either growth or preferential attachment is eliminated, the resulting network does not exhibit scale-free properties. The average path length in the BA-model is proportional to ln n/ln ln n which is shorter than in random graphs scale-free networks are ultrasmall worlds. Observation: non-trivial correlations = clustering between the degrees of connected nodes. Numerical result for AB-model C n-0.75. No analytical predictions of C sofar. Bioinformatics III
Properties of scale-free models Scale-free networks are resistant to random failures („robustness“) because a few high-degree hubs dominate their topology; a deliberate node that fails probably has a small degree, and thus not severly affects the rest of the network. However, scale-free networks are quite vulnerable to attacks on the hubs. See example of last lecture about lethality of gene deletions in yeast. These properties have been confirmed numerically and analytically by studying the average path length and the size of the giant component. Bioinformatics III
Properties of Barabasi-Albert scale-free model • BA-model is a minimal model that captures the mechanisms responsible for the power-law degree distribution observed in real networks. • A discrepany is the fixed exponent of the predicted power-law distribution ( = 3). • Does the BA-model describe the „true“ biological evolution of networks? Recent efforts: • study variants with cleaner mathematical properties (Bollobas, LCD-model) • include effects of adding or re-wiring edges, • allow nodes to age so that they can no longer accept new edges • or vary forms of preferential attachment. • These models also predict exponential and truncated power-law degree distribution in some parameter regimes. Bioinformatics III
2 Scale-free behavior in protein domain networks • ‚Domains‘ are fundamental units of protein structure. • Most proteins only contain one single domain. • Some sequences appear as multidomain proteins. On average, they have 2-3 domains, but can have up to 130 domains! • Most new sequences show homologies to parts of known protein sequences • most proteins may have descended from relatively few ancestral types. Sequence of large proteins often seem to have evolved by joining preexisting domains in new combinations, „domain shuffling“: domain duplication or domain insertion. Wuchty Mol. Biol. Evol. 18, 1694 (2001) Bioinformatics III
Protein domain database SMART http://smart.embl-heidelberg.de/ contains 153 signalling domains 176 nuclear domains, e.g. HLH domains 225 extracellular domains 115 „other“ domains Wuchty Mol. Biol. Evol. 18, 1694 (2001) Bioinformatics III
Protein Domain databases Prosite (http://expasy.proteome.org.au/prosite/) contains 1400 biologically significant motifs and profiles. Pfam (http://www.sanger.ac.uk/Software/Pfam/index.shtml) : collection of multiple-sequence alignments of protein families and profile HMMs. Curated documentation on 2500 families. ProDom (http://www.toulouse.inra.fr/prodom.html) : contains all 160.000 protein domain families that can be automatically generated from SwissProt and TrEMBL databases. Here, only consider families with 10 members 6000 ProDom families. InterPro Proteome Analysis of 41 nonredundant proteomes of genomes of archaea, bacteria, and eukaryotes (http://www.ebi.ac.uk/proteome) yields domains which appear along with other domains in a protein sequence vertices + links. Wuchty Mol. Biol. Evol. 18, 1694 (2001) Bioinformatics III
Protein Domain databases Prosite (http://expasy.proteome.org.au/prosite/) contains 1400 biologically significant motifs and profiles. P(number of links to other domains) number of links to other domains Wuchty Mol. Biol. Evol. 18, 1694 (2001) Bioinformatics III
Which are highly connected domains? The majority of highly connected InterPro domains appear in signalling pathways. List of the 10 best linked domains in various species. Number of links increases. Number of signalling domains (PH, SH3), their ligands (proline-rich extensions), and receptors (GPCR/RHODOPSIN) increases. evolutionary trend toward compartementalization of the cell and multicellularity demands a higher degree of organization. Wuchty Mol. Biol. Evol. 18, 1694 (2001) Bioinformatics III
Evolutionary Aspects • BA-model of scale-free networks is constructed by preferential attachment of newly added vertices to already well connected ones. • Fell and Wagner (2000) argued that vertices with many connections in metabolic network were metabolites originating very early in the course of evolution where they shaped a core metabolism. • Analogously, highly connected domains could have also originated very early. Is this true? No. Majority of highly connected domains in Methanococcus and in E.coli are concerned with maintanced of metabolism. None of the highly connected domains of higher organisms is found here. On the other hand, helicase C has roughly similar degrees of connection in all organisms. Wuchty Mol. Biol. Evol. 18, 1694 (2001) Bioinformatics III
Conclusions • Expansion of protein families in multcellular vertebrates coincides with higher connectivity of the respective domains. • Extensive shuffling of domains to increase combinatorial diversity might provide protein sets which are sufficient to preserve cellular procedures without dramatically expanding the absolute size of the protein complement. • greater proteome complexity of higher eukaryotes is not simply a consequence of the genome size, but must also be a consequence of innovations in domain arrangements. • highly linked domains represent functional centers in various different cellular aspects. They could be treated as „evolutionary hubs“ which help to organize the domain space. Wuchty Mol. Biol. Evol. 18, 1694 (2001) Bioinformatics III
3 How did complexity evolve II? At the molecular level, biological complexity involves networks of ligand-protein, protein-protein, and protein-nucleic acid interactions in metabolism, signal transduction, gene regulation, protein synthesis etc. As organismal complexity increases, more control is required for the positive and negative regulation of genes. Complexity correlates with an increase in both the ratio and absolute number of transcription factors. Amoutzias et al. EMBO reports 5, 1 (2004) Bioinformatics III
Evolution of complex genetic networks Duplication of genes is predominant factor for the generation of new members of a protein family. Duplicated gene creates redundancy if multiple proteins have the same or overlapping function. Alternatively, due to reduced selective pressure, one of the gene copies can become nonfunctional or acquire new function. What is more important? Duplication of single genes or duplication of large gene clusters = „building blocks“ ? Here: Empirical evidence for scale-free protein networks emerging through single-gene duplication. Amoutzias et al. EMBO reports 5, 1 (2004) Bioinformatics III
bHLH protein family bHLH protein family: ancient class of eukaryotic transcription factors found in fungi, plants, and animals. bHLHs may form homo- and heterodimers. They form complex protein-protein interaction network. Very conserved 60-residue ‚basic region‘ – helix – loop – helix motif bHLHs dimerize into 4-helix bundle and recognize DNA with basic regions. Additional regions responsible for activation or repression of target gene activity. http://www.biochem.ucl.ac.uk/bsm/pdbsum/ Bioinformatics III
Two possible patterns of network evolution A Evolution of a heterodimerization network by single-gene duplication. B Evolution of a heterodimerization network by large-scale gene duplication. Amoutzias et al. EMBO reports 5, 1 (2004) Bioinformatics III
bHLH heterodimerization network A phylogenetic analysis of human bHLH proteins B domain architecture Amoutzias et al. EMBO reports 5, 1 (2004) Bioinformatics III
Topology of bHLH heterodimerization network Topology based on protein-protein interaction data (from literature). Networks with hubs: E2A, Arnt, Max. E2A and Arnt sub-networks are connected. Max is distinct. Hubs are shown as circles. Amoutzias et al. EMBO reports 5, 1 (2004) Bioinformatics III
Topology of bHLH heterodimerization network P(k) follows a scale-free behavior! Relative connectivity of hubs is higher ( ~ 1) than reported for most other networks. This high connectivity appears to result from gene duplication generating new, peripheral proteins that interact prefe-rentially with the hub. Amoutzias et al. EMBO reports 5, 1 (2004) Bioinformatics III
Topology of bHLH heterodimerization network The hub proteins are usually widely expressed in different tissues and organs. They heterodimerize with peripheral proteins with more limited expression pattern specific effects. Topology of Max network parallels that of E2A/arnt superfamily: 2 hubs connected by Mad family (repressor) vs. 2 families connected by HES family (repressors). Hubs are shown as circles. Amoutzias et al. EMBO reports 5, 1 (2004) Bioinformatics III
Phylogenetic relationships • Parallelism between E2A/arnt and Max families also reflected in phylogetic relationship: • in ‚Max‘ (and E2A/arnt) network, the 2 hubs (families) are not clustered together • the ‚bridge‘ linking the 2 hubs is a family of repressor proteins that are phylogenetically quite distant from the hubs. • 2 bHLH networks show evolutionary convergence. Amoutzias et al. EMBO reports 5, 1 (2004) Bioinformatics III
Conclusions • For the evolution of networks based on one kind of binding domain, • a model of single-gene duplication, followed by domain rearrangements, point mutations and ongoing gene duplication is sufficient to generate quite complex interaction patterns, which mediate activation and repression. • Seems first example where hub-based network with scale-free properties is based on real-data phylogenies. • (2) Compelling symmetry between the 2 networks E2A/arnt and Max. Amoutzias et al. EMBO reports 5, 1 (2004) Bioinformatics III
Watts-Strogartz model Input: (n, k, p) where n is the number of vertices, k is the distance in which each vertex is connected initially to its neighbors by undirected edges, and p(0 p 1) is the probability of rewiring each edge. Algorithm a) start with a ring lattice with n noces, each has the kth nearest neighbors; Thus, the degree of each vertex is 2k and the ring has (nk) edges. b) Replace original edges by random ones based on the probability p. Bioinformatics III