1. Institute of Information Science, Academia Sinica, Taiwan

A hub-attachment based method to detect functional modules from confidence-scored protein interactions and expression profiles Authors: Chia-Hao Chin1,4, Shu-Hwa Chen1, Chin-Wen Ho4, Ming-Tat Ko1,5, Chung-Yen Lin1,2,3,5 1. Institute of Information Science, Academia Sinica, Taiwan 2. Division of Biostatistics and Bioinformatics, National Health Research Institutes, Taiwan 3. Institute of Fishery Science, College of Life Science, National Taiwan University, Taiwan 4. Department of Computer Science and Information Engineering, National Central University, Taiwan 5. Research Center of Information Technology Innovation, Academia Sinica, Taiwan

Outline • Goal • Method • Experiment results

Detecting functional modules + Identify functional modules by parsing Protein-Protein Interaction (PPI) networks into densely connected regions

A more reliable PPI V1 V3 Gene expression data V2 A PPI network Pearson correlation threshold = 0.6

The overview of HUNTER Module seeds generation An Example module seeds Module seed growth grown modules Modules amalgamation final modules

Module seed generation • Four cases for this stage

Module seed generation(1/4) • Case 1 : • Input data is an unweighted graph. • Find a maximum connected component of the subgraph induced by v's neighbors. This is the subgraph induced by v's neighbors. It is composed of three connected components. This is a maximum connected component of the subgraph induced by v's neighbors. v Union vertices of this sugraph and vertex v. The Union of the vertex set of a maximum connected component and vertex vis a module seed .

A q-connected module • A vertex set U  V is q-connected if the probability is at least q for all WU with at least one edge that connects W with U \ S. [Ulitsky et. al. 2009] a p( {a}, {b, c} ) = 1 - (1-0.8)*(1-0.6) = 0.92 p( {a, b}, {c} ) = 1 - (1-0.8)*(1-0.7) = 0.94 0.8 0.6 c 0.7 b p( {a, c}, {b} ) = 1 - (1-0.6)*(1-0.7) = 0.88 If q = 0.9, then this graph is not q-connected.

Module seed generation(2/4) • Case 2 : • Input data is a weighted graph. • Find a maximum q-connected component of the subgraph induced by v's neighbors. Ifathresholdq = 0.9, then this induced subgraph is not q-connected. Is this subgraph q-connected? This subgraph is q-connected, and the vertex set of it is a module seed. 0.8 0.8 0.1 0.8 Ifathresholdq = 0.9, then this induced subgraph is q-connected. 0.6 0.7 0.8 1.0 v 0.7 Is this subgraph q-connected? 0.6 0.8 Ifathreshold q = 0.9, then this induced subgraph is not q-connected. 0.7 Find a maximum q-connected component of the subgraph induced by v's neighbors.

Module seed generation(3/4) • Case 3 : • Input data is composed of an unweighted graph and gene expression data. • Find a maximum connected component of the subgraph induced by v's neighbors, where the Pearson correlation of any pair of vertices is greater than a threshold. A blue dashed line means its Pearson correlation is less than a threshold t = 0.6 A green dashed line means its Pearson correlation is larger than a threshold t = 0.6 v Check each subgraph by using gene expression data. In this subgraph, the Pearson correlation of each pair of vertices is greater than a threshold, and the vertex set of it is a module seed

Module seed generation(4/4) • Case 4 : • Input data is composed of a weighted graph and gene expression data. • Find a maximum connected component of the subgraph induced by v's neighbors, where the Pearson correlation of any pair of vertices is greater than a threshold. A blue dashed line means its Pearson correlation is less than a threshold t = 0.6 This subgraph is q-connected. 0.8 0.8 A green dashed line means its Pearson correlation is larger than a threshold t = 0.6 0.1 0.8 0.6 0.7 0.8 1.0 The vertex set of this subgraph is a module seed. v 0.7 0.6 0.8 This induced subgraph is not q-connected. 0.7 We check each subgraph by using gene expression data. We check whether this subgraph is q-connected.

v w A grown module Module growth • After creating a module seed, we join the neighbors of the module seed if most of their adjacent nodes also belong to the module seed. v w A module seed

Module amalgamation • we merge any two modules if they have too many common proteins grown module 2 grown module 1 A final module

Functional Group Verification Using Gene Ontology • Gene Ontology • Three separate ontologies: • Biological Process • Molecular Function • Cellular Component • Organized as a DAG describing gene products (proteins and functional RNA) • GO Annotation • A GO term is associated with a gene or gene product to form a GO annotation. http://www.yeastgenome.org/help/GO.html

p-value • Given a gene ontology and termt, the p-value is the probability of observing x or more proteins in the cluster c. • N: the number of proteins annotated to a term of the GO ontology. • M: the number of proteins annotated to the GO term t. • n : the number of proteins of the cluster c. • x : the number of proteins of the cluster c which areannotated to the GO term t. N M n x

F-measure • For each method, we measured • Sensitivity: the fraction of annotations that are enriched in at least one module at p-value < 10-4 [Ulitsky et.al. 2009]. • Specificity: the fraction of modules enriched with at least one annotation at p-value < 10-4 [Ulitsky et. al. 2009].

We compare our method with three newly developed methods • CEZANNA [Ulitsky et. al. 2009] • CMC [Liu et. al. 2009] • Core [Leung et. al. 2009]

Check experiment results by GO

Check experiment results by golden standard databases • p-value: Given a golden standard database and complex g, the p-value is the probability of observing x or more proteins in the cluster c. • N: the number of proteins in a golden standard database. • M: the number of proteins in a complex g of the golden standard database. • n : the number of proteins of the cluster c. • x : the number of proteins of the cluster cwhichalso belong to the complexg. N M n x

Check experiment results by golden standard databases

A cluster of our prediction on yeast PPI RNA Polymerase I Common regulatory unit for RNA polymerase I, II RNA Polymerase II Common module for RNA polymerase I, III TFIIF for RNA polymerase II RNA Polymerase III Common module for RNA polymerase I, II, III

Threshold • q-connected • We set q as 0.95 corresponds to an "error probability" of 0.05. • correlation threshold t • Initiation • A complete graph • given a cutoff threshold • Remove those edges whose Pearson correlation are less or equal than the threshold. cutoff threshold = 0.6 0.7 0.6 0.9 0.8 0.6 0.6

Clustering coefficient i The density of the network surrounding node i, characterized as the number of triangles through i. ki: degree of nodei Ei: edges betweenneighbors of node i’s The center node has 8 (grey) neighbors There are 4 edges between the neighbors C = 2*4 /(8*(8-1)) = 8/56 = 1/7 K is the number of nodes whose degree are larger than 1.

A threshold for Pearson correlation • The authors conjectured that the removed links are likely to be noise as long as the difference between the observed clustering coefficient and its randomized counterpart increases monotonically [Elo et. al. 2007]. A threshold r0 = 0 r1 = 0.01 r100 = 1 C* the first local maximum C( ri) – C0( ri ) threshold

References • Elo LL, Jarvenpaa H, Oresic M, Lahesmaa R, Aittokallio T: Systematic construction of gene coexpression networks with applications to human T helper cell differentiation process. Bioinformatics 2007, 23(16):2096-2103. • Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics 2009, 25(15):1891-1897. • Leung HC, Xiang Q, Yiu SM, Chin FY: Predicting protein complexes from PPI data: a core-attachment approach. J Comput Biol 2009, 16(2):133-144. • Ulitsky I, Shamir R: Identifying functional modules using expression profiles and confidence-scored protein interactions. Bioinformatics 2009, 25(9):1158-1164.

Thank you for your attention!

1. Institute of Information Science, Academia Sinica, Taiwan