341: Introduction to Bioinformatics

341: Introduction to Bioinformatics Dr. Nataša Pržulj Department of Computing Imperial College London natasha@imperial.ac.uk Winter 2011

Topics • Introduction to biology (cell, DNA, RNA, genes, proteins) • Sequencing and genomics (sequencing technology, sequence alignment algorithms) • Functional genomics and microarray analysis (array technology, statistics, clustering and classification) • Introduction to biological networks • Introduction to graph theory • Network properties • Network/node centralities • Network motifs • Network models • Network/node clustering • Network comparison/alignment • Software tools for network analysis • Interplay between topology and biology 2 2

Network Comparisons:Properties of Large Networks • Large network comparison is computationally hard due to NP-completeness of the underlying subgraph isomorphism problem: • Given 2 graphs G and H as input, determine whether G contains a subgraph that is isomorphic to H. • Thus, network comparisons rely on easily computable heuristics (approximate solutions), called “network properties” • Network properties can roughly & historically be divided in two categories: • Global network properties: give an overall view of the network, but might not be detailed enough to capture complex topological characteristics of large networks. • Local network properties: more detailed network descriptors which usually encompass larger number of constraints, thus reducing degrees of freedom in which the networks being compared can vary. 3

1. Global Network Properties Readings: Chapter 3 of “Analysis of biological networks” by Junker and Björn • Global Network Properties: • Degree distribution • Average clustering coefficient • Clustering spectrum • Average Diameter • Spectrum of shortest path lengths • Centralities

1. Global Network Properties • Degree Distribution Definitions: • degree of a node is the number of edges incident to the node. • Average degree of a network: average of the degrees over all nodes in the network. However, avg. deg might not be representative, since the distribution of degrees might be skewed. x deg(x)=5

1. Global Network Properties1) Degree Distribution • Degree distribution: • Let P(k) be the percentage of nodes of degree k in the network. The degree distribution is the distribution of P(k) over all k. • P(k) can be understood as the probability that a node has degree k.

1. Global Network Properties1) Degree Distribution • Example: (log-log plot) • Here P(k) ~ k-γ, where often 2 ≤ γ < 3. This is a power-law, heavy-tailed distribution. • Networks with power-law degree distributions are called scale-freenetworks. In them, most of the nodes are of low degree, but there is a small number of highly-linked nodes (nodes of high degree) called “hubs.”

1. Global Network Properties1) Degree Distribution • Another Example: average degree is meaningful Here P(k)is a Poisson distribution.

1. Global Network Properties1) Degree Distribution • However: degree distribution (and global properties in general) are weak predictors of network structure. • Illustration: G and H are of the same size (i.e.,|G|=|H| -- they have the same number of nodes and edges) and they have same degree distribution, but G and H have very different topologies (i.e., graph stucture).

Examples: G

Research debates… • Assortative vs. disassortative mixing of degrees: • Do high-degree nodes interact with high-degree nodes? • Done by: • Pearson corr. coefficient between degrees of adjacent vertices • Average neighbor degree; then average over all nodes of degree k • Structural robustness and attack tolerance: • “Robust, yet fragile” • Scale-free degree distribution: • “Party” vs. “date” hubs • J.D. Han et al., Nature, 430:88-93, 2004 • Bias in the data collection – sampling? • M. Stumpf et al., PNAS, 102:4221-4224, 2005 • J. Han et al., Nature Biotechnology, 23:839-844, 2005 • High degree nodes: • Essential genes • H. Jeong at al., Nature 411, 2001. • Disease/cancer genes • Jonsson and Bates, Bioinformatics, 22(18), 2006 • Goh et al., PNAS, 104(21), 2007 11

1. Global Network Properties2) Average Clustering Coefficient • Definition: clustering coefficient Cvof a node v: Cv = |E(N(v))|/(max possible number of edges in N(v)) Where N(v) the neighborhood of v, i.e., all nodes adjacent to v Cv can be viewed as the probability that two neighbors of v are connected. Thus 0 ≤ Cv ≤ 1. By definition: For vertex v of degree 0 or 1, by definition Cv=0.

1. Global Network Properties2) Average Clustering Coefficient • Example: • |N(v)|= 4, since there are 4 nodes in N(v), i.e., N(v)= {1, 2, 3, 4} • |E(N(v))|= 3, since there are 3 edges between nodes in N(v) • Max possible number of edges between nodes in N(v) is: choose(4,2) = 6. • Therefore Cv= 3/6 = 1/2

1. Global Network Properties2) Average Clustering Coefficient • Definition: average clustering coefficient, C, of a network is the average Cv over all the nodes v∈ V.

1. Global Network Properties3) Clustering Spectrum • Definition: clustering spectrum, C(k), is the distribution of the average clustering coefficients of all nodes of degree k in the network, over all k. Example:

2) And 3) Clustering Coefficient and Spectrum • Cv – Clustering coefficient of node v • CA= 1/1 = 1 • CB = 1/3 = 0.33 • CC = 0 • CD = 2/10 = 0.2 • … • C = Avg. clust. coefficient of the whole network • = avg {Cv over all nodes v of G} • C(k) – Avg. clust. coefficient of all nodes • of degree k • E.g.: C(2) = (CA + CC)/2 = (1+0)/2 = 0.5 • => Clustering spectrum • E.g. • (not for G) G Need to evaluate whether the value of C (or any other property) is statistically significant.

1. Global Network Properties4) Average Diameter • Definition: the distance between two nodes is the smallest number of links that have to be traversed to get from one node to the other. • Definition: the shortest path is the path that achieves that distance. • Definition: the average network diameter is the average of shortest path lengths over all pairs of nodes in a network.

1. Global Network Properties5) Spectrum of shortest path lengths • Definition: Let S(d) be the percentage of node pairs that are at distance d. The spectrum of shortest path lengths is the distribution of S(d) over d. Example:

4) and 5) Average Diameter and Spectrum of Shortest Path Lengths u • Distance between a pair of nodes u and v: • Du,v = min {length of all paths between u and v} • = min {3,4,3,2} = 2 = dist(u,v) • Average diameter of the whole network: • D = avg {Du,v for all pairs of nodes {u,v} in G} • Spectrum of the shortest path lengths G v E.g. (not for G)

1. Global Network Properties6) Node Centralities (Readings: Chapter 3 of “Analysis of biological networks”-Junker,Björn) • Rank nodes according to their “topological importance” • Definition: • Centrality quantifies the topological importance of a node (edge) in a network. There are many different types of centralities. • There are many different types of centralities: • Degree centrality • Closeness centrality • Eccentricity centrality • Betweenness centrality • Subgraph centrality • Eigenvector centrality • Software tools: Visone (social nets) and CentiBiN (biological nets)

1. Global Network Properties6) Node Centralities • Definitions: • Degree centrality, Cd(v): nodes with a large number of neighbors (i.e., edges) have high centrality. Therefore, we have Cd(v)=deg(v). Example of a use of degree centrality: In PPI networks, nodes with high degree centrality are considered to be “biologically important.” We will learn later in the course what this means. 2. Closeness centrality, Cc(v): nodes with short paths to all other nodes in the network have high closeness centrality Cc(v)=

1. Global Network Properties6) Node Centralities • Definitions: 3. Betweenness centrality, Cb(v): Nodes (or edges) which occur in many of the shortest paths have high betweeness centrality. Cb(v)= Above: The above summation means that there is a sum on the top and on the bottom of the fraction. σst(v) = the number of shortest paths from s to t that pass through v σst = the number of all shortest paths from s to t (they may or not pass through node v) 22

1. Global Network Properties6) Node Centralities • Definitions: 4. Eccentricity centrality, Ce(v): nodes with short paths to any other node have high eccentricity centrality Eccentricity of a node v is defined as ecc(v) = So it is the maximum shortest path length from node u to all other nodes v in V. Eccentricity centrality of a node v: Thus, central nodes have higher Cesince they have lower ecc. There exist many other definitions of node centralities. 23 23

1. Global Network Properties6) Node Centralities • Example:

1. Global Network Properties6) Node Centralities • You need to know how to compute these centralities (and all other network properties) by hand on small networks. • For large real-world networks, you could use software, e.g., CentiBiN. • http://centibin.ipk-gatersleben.de/

Network Properties 2. Local Network Properties (Chapter 5 of the course textbook “Analysis of Biological Networks” by Junker and Schreiber) • They encompass a larger number of constraints, thus reducing degrees of freedom in which networks being compared can vary • How do we show that two networks are different? • How do we show that they are the same? • How do we quantify the level of similarity?

Network Properties 2. Local Network Properties (Chapter 5 of the course textbook “Analysis of Biological Networks” by Junker and Schreiber) • Network motifs • Graphlets Two network comparison measures based on graphlets: 2.1) Relative Graphlet Frequence Distance between two networks 2.2) Graphlet Degree Distribution Agreement between two networks 27

2. Local Network Properties1) Network Motifs (Uri Alon’s group, 2002-2004) • Definition: A network motif is a small over-represented partialsubgraph of real network. Here, over-represented means that it is over-represented when compared to networks coming from a random graph model. Problem: What is expected at random, i.e., which network “null model” to use to identify motifs?

2. Local Network Properties1) Network Motifs Example of a random graph model: • Erdos-Renyi (ER) random graphs – Definition: • A graph on n nodes (for some positive integer n) • Edges are added between pairs of nodes uniformly at random with same probability p ER graphs usually have a small number of dense (in term of number of edges) subgraphs • There will be no regions in the network that have large density of edges. Why?

2. Local Network Properties1) Network Motifs Example: If motifs are identified when comparing the data with ER model networks, every dense subgraph would come up as a motif because they do not exist in our ER model networks.

1) Network motifs (Uri Alon’s group, ’02-’04) • Small subgraphs that are overrepresented in a network when compared to randomized networks • Network motifs: • Reflect the underlying evolutionary processes that generated the network • Carry functional information • Define superfamilies of networks  - Zi is statistical significance of subgraph i, SPi is a vector of numbers in 0-1 • But: • Functionally important but not statistically significant patterns could be missed • The choice of the appropriate null model is crucial, especially across “families” Feed-forward loop

1) Network motifs (Uri Alon’s group, ’02-’04) • Small subgraphs that are overrepresented in a network when compared to randomized networks • Network motifs: • Reflect the underlying evolutionary processes that generated the network • Carry functional information • Define superfamilies of networks  • - Zi is statistical significance of subgraphi, SPi is a vector of numbers in 0-1 • But: • Functionally important but not statistically significant patterns could be missed • The choice of the appropriate null model is crucial, especially across “families” • Random graphs with the same in- and out- degree distribution as data might not be the best network null model • Motifs are partial subgraphs, while we use induced ones to understand network structure

2. Local Network Properties1) Network Motifs Example: Feed-forward loop Shen-Orr, Milo, Mangan, and Alon, “Network motifs in the transcriptional regulation network of Escherichia coli,” Nature Genetics, 2002

1) Network motifs (Uri Alon’s group, ’02-’04) http://www.weizmann.ac.il/mcb/UriAlon/ Also, see Pajek, MAVisto, and FANMOD

2) Graphlets(Przulj group, ’04-’10) _____ • Different from network motifs: • Induced subgraphs • Of any frequency (don’t need to be over-represented) N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.

N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.

2.1) Relative Graphlet Frequency (RGF) distance between networks G and H: N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.

2.2) Graphlet Degree Distributions Generalize node degree

N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.

Network structure vs. biological function & disease Graphlet Degree (GD) vectors, or “node signatures” T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.

Similarity measure between “node signature” vectors T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.

Signature Similarity Measure between nodes u and v T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.

T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

SMD1 YBR095C 40% PMA1 T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

90%* SMD1 RPO26 SMB1 *Statistically significant threshold at ~85% T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

Later we will see how to use this and other techniques to link network structure with biological function

Generalize Degree Distribution of a network • The degree distribution measures: • the number of nodes “touching” k edges for each value of k N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” Bioinformatics, vol. 23, pg. e177-e183, 2007.

341: Introduction to Bioinformatics