CS 6293 Advanced Topics: Translational Bioinformatics

CS 6293 Advanced Topics: Translational Bioinformatics Biological networks:Theory and applications

Lecture outline • Basic terminology and concepts in networks • Some interesting results between network properties and biological functions • Network clustering / community discovery • Applications of network clustering methods

Network • A network refers to a graph • An useful concept in analyzing the interactions of different components in a system

Biological networks • An abstract of the complex relationships among molecules in the cell • Many types. • Protein-protein interaction networks • Protein-DNA(RNA) interaction networks • Genetic interaction network • Metabolic network • Signal transduction networks • (real) neural networks • Many others • In some networks, edges have more precise meaning. In some others, meaning of edges is obscure

Protein-protein interaction networks • Yeast PPI network • Nodes – proteins • Edges – interactions The color of a node indicates the phenotypic effect of removing the corresponding protein (red = lethal, green = non-lethal, orange = slow growth, yellow = unknown).

Obtaining biological networks • Direct experimental methods • Protein-protein interaction networks • Yeast-2-hybrid • Tandem affinity purification • Co-immunoprecipitation • Protein-DNA interaction • Chromatin Immunoprecipitation (followed by microarray or sequencing, ChIP-chip, ChIP-seq) • High level of noises (false-positive and false-negative) • Computational prediction methods • Often cannot differentiate direct and indirect interactions

Why networks? • Studying genes/proteins on the network level allows us to: • Assess the role of individual genes/proteins in the overall pathway • Evaluate redundancy of network components • Identify candidate genes involved in genetic diseases • Sets up the framework for mathematical models For complex systems, the actual output may not be predictable by looking at only individual components: The whole is greater than the sum of its parts

Graphs • A graph G = (V, E) • V = set of vertices • E = set of edges = subset of V  V • Thus |E| = O(|V|2) 1 Vertices: {1, 2, 3, 4} Edges: {(1, 2), (2, 3), (1, 3), (4, 3)} 2 4 3

Graph Variations (1) • Directed / undirected: • In an undirected graph: • Edge (u,v)  E implies edge (v,u)  E • Road networks between cities • In a directed graph: • Edge (u,v): uv does not imply vu • Street networks in downtown • Degree of vertex v: • The number of edges adjacency to v • For directed graph, there are in-degree and out-degree

1 1 2 4 2 4 3 3 In-degree = 3 Out-degree = 0 Degree = 3 Directed Undirected

Graph Variations (2) • Weighted / unweighted: • In a weighted graph, each edge or vertex has an associated weight (numerical value) • E.g., a road map: edges might be weighted w/ distance 1 1 0.3 2 4 2 4 1.2 0.4 1.9 3 3 Weighted Unweighted

Graph Variations (3) • Connected / disconnected: • A connected graphhas a path from every vertex to every other • A directed graph is strongly connectedif there is a directed path between any two vertices 1 2 4 Connected but not strongly connected 3

Graph Variations (4) • Dense / sparse: • Graphs are sparsewhen the number of edges is linear to the number of vertices • |E|  O(|V|) • Graphs are densewhen the number of edges is quadratic to the number of vertices • |E|  O(|V|2) • Most graphs of interest are sparse • If you know you are dealing with dense or sparse graphs, different data structures may make sense

Representing Graphs • Assume V = {1, 2, …, n} • An adjacency matrixrepresents the graph as a n x n matrix A: • A[i, j] = 1 if edge (i, j)  E = 0 if edge (i, j)  E • For weighted graph • A[i, j] = wij if edge (i, j)  E = 0 if edge (i, j)  E • For undirected graph • Matrix is symmetric: A[i, j] = A[j, i]

Graphs: Adjacency Matrix • Example: 1 2 4 3

Graphs: Adjacency Matrix • Example: 1 2 4 3 How much storage does the adjacency matrix require? A: O(V2)

0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 Graphs: Adjacency Matrix • Example: A 1 2 3 4 1 1 2 2 4 3 3 4 Undirected graph

0 5 6 0 5 0 9 0 6 9 0 4 0 0 4 0 Graphs: Adjacency Matrix • Example: A 1 2 3 4 1 1 5 2 6 2 4 3 9 4 3 4 Weighted graph

Graphs: Adjacency Matrix • Time to answer if there is an edge between vertex u and v: Θ(1) • Memory required: Θ(n2) regardless of |E| • Usually too much storage for large graphs • But can be very efficient for small graphs • Most large interesting graphs are sparse • E.g., road networks (due to limit on junctions) • For this reason the adjacency list is often a more appropriate representation

Graphs: Adjacency List • Adjacency list: for each vertex v  V, store a list of vertices adjacent to v • Example: • Adj[1] = {2,3} • Adj[2] = {3} • Adj[3] = {} • Adj[4] = {3} • Variation: can also keep a list of edges coming into vertex 1 2 4 3

Graph representations • Adjacency list 1 2 3 3 2 4 3 3 How much storage does the adjacency list require? A: O(V+E)

A 1 2 3 4 1 0 1 1 0 2 1 0 1 0 3 1 1 0 1 4 0 0 1 0 Graph representations • Undirected graph 1 2 4 3 2 3 1 3 1 2 4 3

A 1 2 3 4 1 0 5 6 0 2 5 0 9 0 3 6 9 0 4 4 0 0 4 0 Graph representations • Weighted graph 1 5 6 2 4 9 4 3 2,5 3,6 1,5 3,9 1,6 2,9 4,4 3,4

Tradeoffs between the two representations |V| = n, |E| = m Both representations are very useful and have different properties, although adjacency lists are probably better for most problems

Structural properties of networks • Degree distribution • Average shortest path length • Clustering coefficient • Community structure • Degree correlation • Motivation to study structural properties: • Structure determines function • Functional structural properties may be shared by different types of real networks (bio or non-bio)

Degree distribution P(k) • The probability that a selected node has exactly (or approximately) k links. • P(k) is obtained by counting the number of nodes N(k) with k = 1, 2… links divided by the total number of nodes N.

Erdos-Renyi model • Each pair of nodes have a probability p to form an edge • Most nodes have about the same # of connections • Degree distribution is binomial or Poisson

Real networks: scale-free • Heavy tail distribution • Power-law distribution • P(k) = k-r

Comparing Random and Scale-free distribution • In the random network, the five nodes with the most links (in red) are connected to only 27% of all nodes (green). In the scale-free network, the five most connected nodes (red) are connected to 60% of all nodes (green) (source: Nature)

Robust yet fragile nature of networks

Shortest and mean path length • Distance in networks is measured with the path length • As there are many alternative paths between two nodes, the shortest path between the selected nodes has a special role. • In directed networks, • AB is often different from the BA • Often there is no direct path between two nodes. • The average path length between all pairs of nodes offers a measure of a network’s overall navigability. • most pairs of vertices in a biological network seem to be connected by a short path – small-world property

Clustering coefficient • Your clustering coefficient: the probability that two of your friends are also friends • You have m friends • Among your m friends, there are n pairs of friends • The maximum is m * (m-1) / 2 • C = 2 n / (m^2-m) • Clustering coefficient of a network: the average clustering coefficient of all individuals

Clustering Coefficient ith node has ki neighbors linking with it Ci=2Ei/ki(ki-1)=2/9 Ei is the actual number of links between ki neighbors maximal number of links between ki neighbors is ki(ki-1)/2 The probability that two of your friends are also friends • Clustering coefficient of a network: average clustering coefficient of all nodes

Degree correlation • Do rich people tend to hang together with rich people (rich-club)? • Or do they tend to interact with less wealthy people? • Do high degree nodes tend to connect to low degree nodes or high degree ones?

Some interesting findings from biological networks • Jeong, Lethality and centrality in protein networks. Nature411, 41-42 (3 May 2001) • Roger Guimerà and Luís A. Nunes Amaral, Functional cartography of complex metabolic networks. Nature433, 895-900 (24 February 2005) • Han, et. al. Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature430, 88-93 (1 July 2004)

Connectivity vs essentiality % of essential proteins Number of connections Jeong et. al. Nature 2001

Community role vs essentiality • Effect of a perturbation cannot depend on the node’s degree only! • Many hub genes are not essential • Some non-hub genes are essential • Maybe a gene’s role in her community is also important • Local leader? Global leader? Ambassador? • Guimerà and Amaral, Nature433, 2005

Community structure

Role 1, 2, 3: non-hubs with increasing participation indices • Role 5, 6: hubs with increasing participation indices

Dynamically organized modularity in the yeast PPI network • Protein interaction networks are static • Two proteins cannot interact if one is not expressed • We should look at the gene expression level • Han, et. al, Nature430, 2004

Obtaining Data

Distinguish party hubs from date hubs • Red curve – hubs • Cyan curve – nonhubs • Black curve – randomized • Partners of date hubs are significantly more diverse in spatial distribution than partners of party hubs

Effect of removal of nodes on average geodesic distance Original Network On removal of date hubs On removal of party hubs Green – nonhub nodes Brown – hubs Red – date hubs Blue – party hubs The ‘breakdown point’ is the threshold after which the main component of the network starts disintegrating.

Dynamically organized modularity Red circles – Date hubs Blue squares - Modules

Han-Yu Chuang, Eunjung Lee, Yu-Tseung Liu, Doheon Lee, Trey Ideker, Network-based classification of breast cancer metastasis, Mol Syst Biol. 2007; 3: 140.

Challenge: Predict Metastasis If metastasis is likely => aggressive adjuvant therapy How to decide the likelihood? Traditional predictive factors are not good

Recently: Gene Marker Sets Examine genome-wide expression profiles Score individual genes for how well they discriminate between different classes of disease Establish gene expression signature Problem: # genes >> # patients

CS 6293 Advanced Topics: Translational Bioinformatics