Speaker: Chuang Chieh Lin Advisor: Professor R. C. T. Lee National Chi-Nan University

How to Reconstruct a Large Genetic Network from n Gene Perturbations in fewer than n2 Easy Steps Andreas Wagner, Bioinformatics, vol. 17, No. 12, 2001, pp. 1183-1187. Speaker: Chuang Chieh Lin Advisor: Professor R. C. T. Lee National Chi-Nan University CSIE in National Chi-Nan University

Outline • Introduction and basic definitions • Graph theoretical framework • Parsimonious network • Algorithm and complexity • Cycles in genetic networks • Conclusions • References CSIE in National Chi-Nan University

Introduction and basic definitions • Gene activity includes whether a gene is expressed or not, as mRNA, as protein etc.. • Gene network: In this paper, we define a genetic network as a group of genes in which individual gene can influence the activity of other genes. • The core task of reconstructing genetic networks is to identify the causal structure of a gene network. CSIE in National Chi-Nan University

To reconstruct a genetic network is to identify, for each network gene, which other genes and their activity the gene influences directly. • Now, let’s see an illustration of genetic network. CSIE in National Chi-Nan University

transcription factor protein kinase protein phosphatase transcription factor inactive inactive P protein P active active DNA Gene 5 Gene 4 Gene 2 Gene 3 Gene 1 This is a hypothetical biochemical pathway involving two transcription factors, a protein kinase and a protein phosphatase, as well as the genes encoding them. CSIE in National Chi-Nan University

Genetic perturbation: an experimental manipulation of gene activity by manipulating either a gene itself or its product. It includes point mutations, gene deletions, or other interference with the activity of the product. CSIE in National Chi-Nan University

transcription factor protein kinase protein phosphatase transcription factor inactive inactive P protein P active active DNA Gene 5 Gene 4 Gene 2 Gene 3 Gene 1 Genetic perturbation: gene deletion Genetic perturbation: gene deletion Aspect of gene activity: mRNA expression Aspect of gene activity: phosphorlation state G1: G2, G5 G1: G3, G4 G2: G5 G2: G3, G4 G3: G5 G3: G4 G4: G5 G4: G5: G5: CSIE in National Chi-Nan University

Graph theoretical framework • As the previous instance indicated, we are concerned with qualitative information on gene interaction. • We consider a “digraph”, a graph representation of genetic networks, to this qualitative information. • A digraph is a directed graph consisting of nodes and directed edges. • Let’s see an example. CSIE in National Chi-Nan University

We use a → b to mean that gene a influence the activity of gene b directly. For brevity, genes will be labeled by numbers from now on. 1 13 18 17 4 8 11 20 10 7 9 19 6 3 2 5 15 12 16 0 14 CSIE in National Chi-Nan University

Adjacency list: for each gene i, it simply shows which genes’ activity state the gene i influences directly. • We denote Adj(G) to be the adjacency list of graph G and Adj(i) to be the set of nodes (genes) adjacent to (directly influenced by) node i. CSIE in National Chi-Nan University

0: 16 1: 2: 3: 2 5 8 4: 5: 12 6: 5 12 7: 2 17 8: 9: 10 15 10: 1 20 11: 20 12: 14 13: 8 17 14: 0 15: 0 16: 2 17: 8 18: 19: 8 20: 6 18 Adjacency list of G: 1 13 18 17 4 8 11 20 10 7 9 19 6 3 2 5 15 12 16 0 14 G CSIE in National Chi-Nan University

Accessibility list: the list of perturbation effects or the list of regulatory effects. It shows all nodes (genes) that can be accessed (influenced in their activity state) from a given gene by paths of direct interactions. • We denote Acc(G) to be the accessibility list of the graph G and Acc(i) to be the set of nodes that can be reached (influenced) from node (gene) i. CSIE in National Chi-Nan University

0: 2 16 1: 2: 3: 0 2 5 8 12 14 16 4: 5: 0 2 12 14 16 6: 0 2 5 12 14 16 7: 2 8 17 8: 9: 0 1 2 5 6 10 12 14 15 16 18 20 10: 0 1 2 5 6 12 14 16 18 20 11: 0 2 5 6 12 14 16 18 20 12: 0 2 14 16 13: 8 17 14: 0 2 16 15: 0 2 16 16: 2 17: 8 18: 19: 8 20: 0 2 5 6 12 14 16 18 Accessibility list of G: 1 13 18 17 4 8 11 20 10 7 9 19 6 3 2 5 15 12 16 0 14 G CSIE in National Chi-Nan University

Before proceeding with the algorithm, we have to give some concepts and theorems first. CSIE in National Chi-Nan University

The most parsimonious network • An acyclic digraph defines its accessibility list, but an accessibility list may have more than one corresponding acyclic digraph. • Let’s see an example first. CSIE in National Chi-Nan University

(d) is the most parsimonious network of Acc, i.e., (a). 0 0: 1 2 3 4 5 1: 2 3 4 5 2: 3 4 5 3: 4: 5 5: 1 2 4 3 (b) (a) 5 0 0 1 1 2 2 4 3 4 3 5 5 (c) (d) CSIE in National Chi-Nan University

An accessibility list Acc and a digraph G are compatible if G has Acc as its accessibility list. Acc is the accessibility list induced by G. • Gpars is called the most parsimonious network compatible with Acc. CSIE in National Chi-Nan University

Why we prefer the most parsimonious network? • We prefer simplest or most parsimonious one of gene network. • For any accessibility list Acc of a digraph G, there exists a most parsimonious network Gpars. (From a result of a theorem.) Therefore Gpars is the core of all the corresponding digraphs. • More complicated digraphs make people confused. CSIE in National Chi-Nan University

Theorem 1 • Let Acc be the accessibility list of an acyclic digraph. Then there exists exactly one graph Gpars that has Acc as its accessibility list and that has fewer edges than any other graph G with Acc as its accessibility list. • Before starting the proof, we need to introduce some terminology. CSIE in National Chi-Nan University

Range and shortcut • Consider two nodes i and j of a digraph that are connected by an edge e. The ranger of the edge e is the length of the shortest path between i and j in the absence of e. If there is no other path connecting i and j, then r : = . • An edge e with range r≥ 2 but is called a shortcut. • Let’s see an example. CSIE in National Chi-Nan University

e j i e is a shortcut. When eliminating e, i and j are still connected by a path of length k + 1, so r(e) = k + 1. r(e) = k + 1 zk z1 zk-1 z2 zk-2 CSIE in National Chi-Nan University

Lemma 1 • For any accessibility list Acc of a digraph, there exists a compatible graph Gpars that is free of shortcuts. CSIE in National Chi-Nan University

ei yi yi xi xi Pi Pi Length of Piis greater than 1. Proof of Lemma 1 • Assume that there is no such graph Gpars. deleting ei If there exists a shortcut ei between xi and yi , delete ei . Then by the definition of shortcut, we’ll derive that xi and yi are still connected via Pi , whose length is greater than 1. CSIE in National Chi-Nan University

Suppose that we have n possible (xi , yi), i.e., (x1, y1), …, (x1, xn). After repeating all possible (xi , yi), i = 1, …, n, we’ll derive a shortcut-free graph compatible with the accessibility list. This is a contradiction to the assumption made in the beginning of this proof. CSIE in National Chi-Nan University

Lemma 2 • Assume that Acc is the accessibility list of a digraph G. For each node x, the adjacency list Adj(x) of a shortcut-free graph Gpar compatible with Acc is a subset of the adjacency list Adj(x) of any graph compatible with Acc. CSIE in National Chi-Nan University

Proof of Lemma 2 • Assume that Lemma 2 is false. • W. L. O. G., suppose that a shortcut-free graph Gpars and some other graph G induce Acc. • By assumption, Gpars contains at least one node x so that Adj(x) of Gpars contains at least one node y that isn’t in Adj(x) of G. CSIE in National Chi-Nan University

Because G and Gpars have the same accessibility list Acc, there must exist some path x → z1 → z2 → … → zk → y from x to y in G. For the same reason, z1 is accessible from x in Gpars, z2 from z1 in Gpars, … and zk from zk-1 in Gpars. • Therefore we can find two paths (x →…→y) in Gpars: (1) the edge e between x and y (2) the path x → z1 →z2 →… →zk →y • This is in contradiction to the assumption that Gpars is shortcut-free because e is a shortcut. Let’s see an example! CSIE in National Chi-Nan University

x z1 z2 y G x: z1 y z1: z2 z2: y x: z1z2 z1: z2 z2: y x: z1z2y z1: z2y z2: y Acc: Adj(Gpars): Adj(G): x z1 A shortcut! z2 y Gpars CSIE in National Chi-Nan University

Corollary 1 • The shortcut-free graph Gpars compatible with Acc is a unique graph with the fewest edges among all graphs G compatible with Acc. • This corollary follows immediately from Lemma 2. CSIE in National Chi-Nan University

Now, we can proceed to the algorithm. CSIE in National Chi-Nan University

A recursive pruning algorithm to reconstruct the most parsimonious graph from an accessibility list. 1: for all nodes i of G 2: Adj(i) = Acc(i) 3: for all nodes i of G 4: if node i hasn’t been visited 5: call PRUNE_ACC(i) 6: end if 7: PRUNE_ACC(i) 8: for all nodes j Acc(i) 9: if Acc(j) = 10: declare j as visited. 11: else 12: call PRUNE_ACC(j) 13: end if 14: for all nodes jAcc(i) 15: for all nodes k Adj(j) 16: if k Acc(i) 17: delete k from Adj(i) 18: end if 19: declare node i as visited 20: end PRUNE_ACC(i) CSIE in National Chi-Nan University

This algorithm is based on the following theorem, so we have to get something from the theorem. CSIE in National Chi-Nan University

Theorem 2 • Let Acc(G) be the accessibility list of an acyclic digraph, Gpars its most parsimonious graph, and V(Gpars) the set of all nodes of Gpars. Then the following identity holds: • In stead of proving the theorem, we give an example later. CSIE in National Chi-Nan University

0 0 0 1 1 1 2 2 2 4 4 3 3 4 3 5 5 5 0: 1 2 3 4 5 1: 2 3 4 5 2: 3 4 5 3: 4: 5 5: 0: 1 1: 2 3 4 5 2: 3 4 5 3: 4: 5 5: 0: 1 1: 2 2: 3 4 5 3: 4: 5 5: Original Acc(G) 1 via 2, 3, 4, 5 0 via 1, 2, 3, 4, 5 A possible corresponding G CSIE in National Chi-Nan University

0 0 0 1 1 1 2 2 2 4 4 4 3 3 3 5 5 5 0: 1 1: 2 2: 3 4 3: 4: 5 5: 0: 1 1: 2 2: 3 4 3: 4: 5 5: 0: 1 1: 2 2: 3 4 5 3: 4: 5 5: 2 via 3, 4, 5 4 via 5 The most parsimonious network CSIE in National Chi-Nan University

Actually, the aforementioned example is an illustration of our algorithm. • From this theorem, we can derive Corollary 2. CSIE in National Chi-Nan University

i A shortcut !! j k Corollary 2 • Let i, j and k be any three pairwise different nodes of an acyclic directed shortcut-free graph G. If j is accessible from i, then no node k accessible from j is adjacent to i. CSIE in National Chi-Nan University

Computational complexity • Let k < n− 1 be the average number of entries in a node’s accessibility list. • Assume that there are n genes, that is, n entries. CSIE in National Chi-Nan University

During execution, each node accessible from a node j induces one recursive call of PRUNE_ACC, after which the node accessed from j is declared as visited. Thus each entry of the accessibility list of a node is explored no more than once. • Line 15 of the algorithm loops over all nodes adjacent to a node j. Let a denotes the average number of entries in Adj(j). • The overall computational complexity would be O(nka). CSIE in National Chi-Nan University

For practical matters, large scale experimental gene perturbations in the yeast Saccharomyces cerevisiae (n≈ 6300) suggest that k < 50 ([HMJRS2000]), a≤ 1 ([W2001a]) and thus nka << n2. CSIE in National Chi-Nan University

Storage complexity • The algorithm stores two copies of the accessibility list, as well as a list of the nodes that has been visited. • Because the graph is acyclic, the recursion depth can be no greater than n − 1. • Note that k < n− 1 is the average number of entries in a node’s accessibility list. • The overall storage requirements are O(nk). CSIE in National Chi-Nan University

Outline • Introduction and basic definitions • Graph theoretical framework • Parsimonious network • Algorithm and complexity • Cycles in genetic networks • Conclusions CSIE in National Chi-Nan University

Dealing with cycles • All we have mentioned are restricted on acyclic graphs. • Now let us go to see the problems brought by cyclic graphs. CSIE in National Chi-Nan University

1 2 4 3 2 1 0 4 3 0 Problems that single gene perturbation can’t solve They have the same accessibility list. Therefore, we can not reconstruct the gene network uniquely. 0: 1 2 3 4 1: 0 2 3 4 2: 0 1 3 4 3: 0 1 2 4 4: 0 1 2 3 CSIE in National Chi-Nan University

1 2 4 3 2 1 0 4 3 0 0: 3 1: 4 2: 1 3: 2 4: 0 0: 1 1: 2 2: 3 3: 4 4: 0 Note that the order of direct regulatory interactions in these two networks is different, as reflected in the adjacency lists. CSIE in National Chi-Nan University

Instead of solving this problem, we collapse the nodes which form a cycle into a single group of nodes with indistinguishable order of regulatory interactions. • Such a single group can be also called a strongly connected component or strong component of a directed graph G. Every two nodes in a strong component are mutually accessible. • Let us see an example. CSIE in National Chi-Nan University

Speaker: Chuang Chieh Lin Advisor: Professor R. C. T. Lee National Chi-Nan University